Scraping n levels deep - Githubissues

ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI

https://scrapegraphai.com

MIT License

16.02k stars 1.31k forks source link

Scraping n levels deep #112

Closed rawmean closed 6 months ago

rawmean commented 7 months ago

Is your feature request related to a problem? Please describe. I'd like to scrape a website n-levels deep.

Describe the solution you'd like For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too

Describe alternatives you've considered I can use BeautifulSoup and download the pages and then feed them to this

PeriniM commented 7 months ago

Hei @rawmean, we will add it in the to-do list for feature requests! It would be interesting to create a new graph for this and maybe calling it CrawlerGraph or DeepScraperGraph

mayurdb commented 6 months ago

I'll try to take a stab at it. This is what I'm thinking: Input: URL

FetchNode
ParseNode
RAGNode
SearchLinkNode -> Get all the links on the page
(new) LinkFilterNode -> Filter out potentially relevant links
(new) RepeaterNode -> Executes graph from child node onwards once for each of the input link in parallel
FetchNode
ParseNode
RAGNode
(new) ContainsAnswerNode -> A new node type that can tell if the currect content contains the answer
(new) ConditionalNode -> A new node with two children, if parent returns true, pick child 1 or else pick child 2 12a. GenerateAnswerNode 12b. Go to step 4 for next level of depth

Let me know if this looks reasonable or if you have any other plan/better alternative that you can think of

VinciGit00 commented 6 months ago

Yeah, pls contact me thorough email (mvincig11@gmail.com)

ChrisDelClea commented 6 months ago

Sounds really intresting.

davideuler commented 2 months ago

I am looking for the feature too. There are two use cases: 1.Loop through several path levels of a website, to extract information from all item pages. like to extract all shop item informations, all renting houses prices and locations. In this case, I can specify which paths will be processed by regex expressions. 2.Loop through all pages of a small website. It behaves like crawler as nutch, while I can specify what I will get from each page. There is a prompt to match the target page, and a prompt to get data/files from that page. Sometimes I need to crawl all videos/images of a specified condition for the website.