Scrape sub pages too - Githubissues

Oh, wow so cool you want to use it in a demo! Glad that can help. :)

To the first question, you should be able to just filter urls https://docs.haystack.deepset.ai/ by using some filter string like "docs.haystack.deepset.ai". To actually go around all subpages, with the crawler node you could get to many of them just by using this filter and using the homepage as intial query, that though will not assure you are actually reaching all the pages as they have to be linked between each other, and then that the parsing finds all links.

What would probably work better would be to try looking at the sitemap.xml file I found when looking at robots.txt file. You should be able to load this in python and then just pass all the urls in a list to the scraper node using run batch. (Not sure if this sitemap.xml is standard format in all websites maybe would be a cool implementation to the scraper node?)

If for some reason this sitemap xml doesn't contain all the urls in the webpage you could even try setting in a crawler node, query=[""], the "docs.haystack.deepset.ai" filter, then set crawler.stack = [sitemapxml links list] and let it run until it has no more links to find, so put number of pages to a big number.

Hope it helps! Any more questions or if you find bugs happy to help!

Haradai / newspaper3k-haystack

Scrape sub pages too #2