LucknowAI / Lucknow-LLM

Collecting data for Building Lucknow's first LLM
17 stars 27 forks source link

Build scrapper to continuously update the unstructured data folder with latest Lucknow data #38

Open monk1337 opened 7 months ago

monk1337 commented 7 months ago

Right now the unstructured data folder contains limited data, we need scrappers to scrape the data from different Lucknow websites so that if we want to add more data in the future or update the database of the Lucknow we can simply run those scrappers agents.

thePratyakshSoni1 commented 7 months ago

I can do it but i will need list of websites from which to fetch the data. Like if there's a blogging site then whenever we will run our scrapper so new blogs will be added to unstructured data.

AayushSharma-1 commented 7 months ago

How about we build this scraper in parts, like someone takes the tourism part, someone takes the hospitals part, and later on, we can combine them to make a fully automated raw data scraper?

thePratyakshSoni1 commented 7 months ago

How about we build this scraper in parts, like someone takes the tourism part, someone takes the hospitals part, and later on, we can combine them to make a fully automated raw data scraper?

That would be nice, but we will still need list of sites ( that regularly update data on specific topic ) to target them for latest data.

Or we can have another folder called scrapped in Unstrcured_data folder and we can scrap any data related to lucknow by our program, ( can be in different files that are named based on date or something else ) in it.

monk1337 commented 7 months ago

@pratyakshSoni1 @AayushSharma-1 That's a great idea to take care of one topic and build the scrapper step by step. @AayushSharma-1 you can go through the old PRs of this repo, those who are contributing the unstructured data, are also mentioning the source of websites/links in the PR description, we can use those websites to scrape.

AayushSharma-1 commented 7 months ago

Yes, Sure!