Create web scraper for extracting data

aashutosh-samagra commented 8 months ago

Illustrated Technical Book Hindi Webpage with PDFs Link
Rabi Crop Details English Webpages with Text Link

rachitavya commented 8 months ago

Hello @ChakshuGautam @aashutosh-samagra 👋,

As much as I can get from the issue mentioned:

We want to download all the pdfs on the link.
We want data of every rabi crop from each crop's link. This scrapped data will be stored in JSON format maybe.

Correct me if I am wrong.

I can do the task using the bs4 in python by iterating to every crop's link from the page source.

ChakshuGautam commented 8 months ago

@rachitavya your assumptions are correct. We are looking at a single file python script with minimal dependencies to do this. bs4 should work.

rachitavya commented 8 months ago

I can do it by end of the day tomorrow. Kindly assign me. @ChakshuGautam

ChakshuGautam commented 8 months ago

Hey @rachitavya how is it going?

rachitavya commented 8 months ago

Hello @ChakshuGautam

PDF task is completed.
Regarding rabi crops website, the scrapper is able to scrap all the data from all the links into a JSON except the tables. Tables are the one left as of now. On it. Will be posting updates here.

rachitavya commented 8 months ago

Hey @ChakshuGautam, There are multiple types of tables present in the web pages (having different dimensions in no particular pattern). I am a bit confused how to write one generic code for such variety of tables that we have.

ChakshuGautam commented 8 months ago

Got it. If there is no pattern, don't need to generalize. Let's feed that to an LLM and ask it to parse it.

rachitavya commented 8 months ago

Understood. This will be done by tomorrow. 👍

rachitavya commented 7 months ago

Hello @ChakshuGautam 👋🏻

The tasks are done and I want to mention a couple of constraints:

I've tried many LLMs for table parsing and ChatGPT turns out to be the best. There are API key constraint regarding it.
This fetching of table from API is taking so much time slowing down the scrapping.

On the other hand, I tried coding approach also, but it works only on those tables who are in proper (non complex) structures.

ChakshuGautam commented 7 months ago

Can you share the repo here. Let me Also get into this.

rachitavya commented 7 months ago

Here it is: https://github.com/rachitavya/.github

singhalkarun commented 7 months ago

Hey @rachitavya,

Can we please move the code to a separate repository, you should not use .github repository for this.
I see you have mentioned there is some constraint associated with API Key for ChatGPT. Can you please describe? You can find steps here to generate and use API Key.
Can you please detail on how many different types of tables are there? We don't need to write generic code. If the ChatGPT fails to help us, we can write different codes for different structures instead of trying to generalise them, our prime goal is to get data, and it's a one time thing for now.
Also, we are expecting output to be in form of pdf files.

rachitavya commented 7 months ago

Hey @singhalkarun

Will be done.
Constraint was just regarding the limited free credits and nothing else. Also, key can't be pushed on GitHub.
ChatGPT is helping us for all the cases as of now. It is just taking time which I mentioned otherwise it is working smoothly. There's only one table which is too large to process by the ChatGPT, otherwise there's no other limit. Types of tables: For the simple column tables, the code is ready and work smoothly. But there are some tables which have first row for columns and then second row for sub-columns (no of sub > no of columns). For the latter, implementing code was getting difficult.
Not the JSON ? Putting HTML tables in PDF might be an easy task than putting in JSON.

singhalkarun commented 7 months ago

Constraint was just regarding the limited free credits and nothing else. Also, key can't be pushed on GitHub.

Can we please pick the key from an environment file (.env file) and push the code that picks it from the env file?

ChatGPT is helping us for all the cases as of now. It is just taking time which I mentioned otherwise it is working smoothly. There's only one table which is too large to process by the ChatGPT, otherwise there's no other limit.

No worries about that. We can use a paid account as it is a one time activity. Also, we can use GPT 4 if it can speed up the process. @ChakshuGautam that will be fine, right?

For the simple column tables, the code is ready and work smoothly. But there are some tables which have first row for columns and then second row for sub-columns (no of sub > no of columns). For the latter, implementing code was getting difficult.

Great. Maybe use code for tables which is possible. And for which it fails, let's use ChatGPT. Also, can I know what percent of tables are these of the whole things? If it is not too much, we can get it manually done.

Not the JSON ? Putting HTML tables in PDF might be an easy task than putting in JSON.

Yeah we need PDF as the end output. @ChakshuGautam can you please detail more on this if I am missing anything.

rachitavya commented 7 months ago

I am already using that .env approach. Consider it done. Cool, I'll be creating scripts for PDF output now.

singhalkarun commented 7 months ago

I am already using that .env approach. Consider it done. Cool, I'll be creating scripts for PDF output now.

Thanks. Please share the repository link where you are maintaining code (please switch to a newer repository if still using .github). Also, please do add steps in README on how to run it on my system. Please ensure the pdfs which are generated using script gets saved inside a folder called output in the root directory of repository.

Let's try to close this by tomorrow (atleast a v1, it's fine if we have the complex tables getting missed, let's solve that once we have v1). Feel free to reach out for any other concern.

rachitavya commented 7 months ago

Here's the repo: https://github.com/rachitavya/crops_webpages_scrapper

singhalkarun commented 7 months ago

I am already using that .env approach. Consider it done.

Hey @rachitavya, can you please add a sample.env file in the repository which specifies what environment variables have to be added.

rachitavya commented 7 months ago

Missed it previously, done now.

rachitavya commented 7 months ago

Hey @singhalkarun @ChakshuGautam The task is done now and V1 is ready. You can check the README.md file for how to run.

Outputs are getting saved in PDF format now for each crop. Additionally, JSON format is also there. No ChatGPT was required for PDF based output scrapper.

ChakshuGautam commented 7 months ago

Can you use Git LFS to store the output PDFs in the repo too?

rachitavya commented 7 months ago

I think they can be uploaded directly without the LFS because total file size is less than 1MB. If you still want me to use LFS, I can do that too.

Ohh, I got it For the PDF books, i.e. the first point in this issue, the files are large and I have to use LFS there. Will do.

rachitavya commented 7 months ago

I've uploaded the rabi_crops pdf in the repo itself.

rachitavya commented 7 months ago

Hey @ChakshuGautam @singhalkarun All the files (PDFs) were uploaded to the repo and the README also had instruction to run the scrapper if needed. Please let me know if you guys need any assistance from my side.

UP-LIFT / .github

Create web scraper for extracting data #2