ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

scrape a page more than once #120

Open notslang opened 6 years ago

notslang commented 6 years ago

It would be nice to have a way to grab a page more than once within a single crawl. Some sites (like search engines) don't have many links between their pages, but will display recent search queries, or recently indexed content on their home page. By hitting pages like that multiple times, you can find new urls to follow and add to the queue, allowing you to grab a much larger portion of the site than you would if you only hit that page once.

Simply having a tool like "gs-add-urls" that could add a URL into the bottom of the queue would be enough - then I could run that whenever the queue starts to look empty.

I know that I could just start a new crawl with the same start URL, but then it'll have to revisit all the URLs that the previous crawl did, and for large sites this would take a long time and create a lot of duplication.

This type of functionality would also be useful for archiving rss feeds that update regularly.

raspher commented 4 years ago

How to prevent infinite crawling?

TheTechRobo commented 3 years ago

This would be great. Not much code would be required, too, since from the looks of things most would be able to be duplicated from the automated adding to the queue. Unfortunately I don't understand the code very well, so I can't help, but it shouldn't be too hard.

TheTechRobo commented 2 years ago

Hang on...wouldn't this conflict with the dupespotter?