Open HripsimeS opened 2 months ago
@HripsimeS Lines 83-85 of this code, transfers extracted web data to GCP cloud bucket for which your own credentials and is required but if you don't want to use cloud bucket, you can simply comment these lines and run it again. Extracted file will be saved in your local drive. I am talking about these lines of code
client = storage.Client() bucket = client.get_bucket(CFG.bucket) bucket.blob(CFG.path+'/'+filename).upload_from_string(df.to_csv(), 'text/csv')
@Alekh-sinha thank you, it helped to fix that particular issue with the credentials error. In web_extraction.py file I modified the URL path this way to be able to read the whole path direction.
base_url = "https://www.ibm.com" relative_url = "/topics/large-language-models/" URL = base_url + relative_url
By executing the web_extraction.py file, as an outcome I got csv file where I have these information extracted.
Then when I execute rag.py file, I got the IndexError: list index out of range. You can see it below
Do you have any ideas what is going wrong and why I get that issue with the URL I used? Thanks in advance!
@HripsimeS I have changed the code and now it should work for you. Basically this error is because of its inability to find the csv file generated by the web_extraction.py. I have defined a working directory now where I have saved everything and now code should work for you.
@Alekh-sinha I tested again web_extraction.py script with my URL path: https://www.ibm.com/topics/large-language-models
working_dir folder is created, where there is a csv file with the following information.
@HripsimeS and is there any error associated with it? I mean it will extract text from html and store it in single index.
@Alekh-sinha no I don't get any errors, after execution of web_extraction.py script working_dir folder is creating where there is a csv file with the information I shared above.
@Alekh-sinha Hello,
I am trying to test your project. I changed URL path on the line 67 of web_extraction.py file, but when I run it on command line I get the following error.
google.auth.exceptions.DefaultCredentialsError: Your default credentials were not found.
Do I still to make more modifications in the script or I need to set up Application Default Credentials? If possible, can please also send the URL path you used in web_extraction.py file