Open aashutosh-samagra opened 8 months ago
Hello @ChakshuGautam @aashutosh-samagra 👋,
As much as I can get from the issue mentioned:
Correct me if I am wrong.
I can do the task using the bs4 in python by iterating to every crop's link from the page source.
@rachitavya your assumptions are correct. We are looking at a single file python script with minimal dependencies to do this. bs4 should work.
I can do it by end of the day tomorrow. Kindly assign me. @ChakshuGautam
Hey @rachitavya how is it going?
Hello @ChakshuGautam
Hey @ChakshuGautam, There are multiple types of tables present in the web pages (having different dimensions in no particular pattern). I am a bit confused how to write one generic code for such variety of tables that we have.
Got it. If there is no pattern, don't need to generalize. Let's feed that to an LLM and ask it to parse it.
Understood. This will be done by tomorrow. 👍
Hello @ChakshuGautam 👋🏻
The tasks are done and I want to mention a couple of constraints:
On the other hand, I tried coding approach also, but it works only on those tables who are in proper (non complex) structures.
Can you share the repo here. Let me Also get into this.
Here it is: https://github.com/rachitavya/.github
Hey @rachitavya,
Hey @singhalkarun
- Constraint was just regarding the limited free credits and nothing else. Also, key can't be pushed on GitHub.
Can we please pick the key from an environment file (.env file) and push the code that picks it from the env file?
- ChatGPT is helping us for all the cases as of now. It is just taking time which I mentioned otherwise it is working smoothly. There's only one table which is too large to process by the ChatGPT, otherwise there's no other limit.
No worries about that. We can use a paid account as it is a one time activity. Also, we can use GPT 4 if it can speed up the process. @ChakshuGautam that will be fine, right?
For the simple column tables, the code is ready and work smoothly. But there are some tables which have first row for columns and then second row for sub-columns (no of sub > no of columns). For the latter, implementing code was getting difficult.
Great. Maybe use code for tables which is possible. And for which it fails, let's use ChatGPT. Also, can I know what percent of tables are these of the whole things? If it is not too much, we can get it manually done.
- Not the JSON ? Putting HTML tables in PDF might be an easy task than putting in JSON.
Yeah we need PDF as the end output. @ChakshuGautam can you please detail more on this if I am missing anything.
I am already using that .env approach. Consider it done. Cool, I'll be creating scripts for PDF output now.
I am already using that .env approach. Consider it done. Cool, I'll be creating scripts for PDF output now.
Thanks. Please share the repository link where you are maintaining code (please switch to a newer repository if still using .github). Also, please do add steps in README on how to run it on my system. Please ensure the pdfs which are generated using script gets saved inside a folder called output in the root directory of repository.
Let's try to close this by tomorrow (atleast a v1, it's fine if we have the complex tables getting missed, let's solve that once we have v1). Feel free to reach out for any other concern.
Here's the repo: https://github.com/rachitavya/crops_webpages_scrapper
I am already using that .env approach. Consider it done.
Hey @rachitavya, can you please add a sample.env file in the repository which specifies what environment variables have to be added.
Missed it previously, done now.
Hey @singhalkarun @ChakshuGautam The task is done now and V1 is ready. You can check the README.md file for how to run.
Outputs are getting saved in PDF format now for each crop. Additionally, JSON format is also there. No ChatGPT was required for PDF based output scrapper.
Can you use Git LFS to store the output PDFs in the repo too?
I think they can be uploaded directly without the LFS because total file size is less than 1MB. If you still want me to use LFS, I can do that too.
Ohh, I got it For the PDF books, i.e. the first point in this issue, the files are large and I have to use LFS there. Will do.
I've uploaded the rabi_crops pdf in the repo itself.
Hey @ChakshuGautam @singhalkarun All the files (PDFs) were uploaded to the repo and the README also had instruction to run the scrapper if needed. Please let me know if you guys need any assistance from my side.
Illustrated Technical Book Hindi Webpage with PDFs Link
Rabi Crop Details English Webpages with Text Link