Haradai / newspaper3k-haystack

An integration of the nespaper3k library in haystack.
MIT License
4 stars 2 forks source link

Scrape sub pages too #2

Closed TuanaCelik closed 4 months ago

TuanaCelik commented 1 year ago

Hey @Haradai Thank you so much for this integration 🙏 I was trying to use it for a demo, and I wanted to scrape the pages in https://docs.haystack.deepset.ai Is there a way I could scrape beyond this URL? Could I also scrape the subpages like https://docs.haystack.deepset.ai/docs/document_store and so on?

Haradai commented 1 year ago

Oh, wow so cool you want to use it in a demo! Glad that can help. :)

To the first question, you should be able to just filter urls https://docs.haystack.deepset.ai/ by using some filter string like "docs.haystack.deepset.ai". To actually go around all subpages, with the crawler node you could get to many of them just by using this filter and using the homepage as intial query, that though will not assure you are actually reaching all the pages as they have to be linked between each other, and then that the parsing finds all links.

What would probably work better would be to try looking at the sitemap.xml file I found when looking at robots.txt file. You should be able to load this in python and then just pass all the urls in a list to the scraper node using run batch. (Not sure if this sitemap.xml is standard format in all websites maybe would be a cool implementation to the scraper node?)

If for some reason this sitemap xml doesn't contain all the urls in the webpage you could even try setting in a crawler node, query=[""], the "docs.haystack.deepset.ai" filter, then set crawler.stack = [sitemapxml links list] and let it run until it has no more links to find, so put number of pages to a big number.

Hope it helps! Any more questions or if you find bugs happy to help!