ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

infinite recursion on offsite links? #194

Open TheTechRobo opened 2 years ago

TheTechRobo commented 2 years ago

how would I go about enabling that?

acrois commented 2 years ago

How deep do you really want to go?

A middle ground ideally would be to support a configurable depth for crawls to avoid finding every page on the internet.

Unless that's your thing... You can try to use it as is and by what it says, it seems like it should do that, but I have not considered that a reasonable thing for a single process to be responsible for and haven't experimented with that much beyond basic/plaintext sites.

Personally, I always run with --no-offsite-links (avoid following links to a depth of 1 on other domains). It will crawl immediate pre-requisite resources but not any links found past that. Then I'll set up a whole crawl of the site and read the index for off-site URLs. Then take the list and divide up those sites into separate crawls. You could call it a system.

What did you do? What should happen? What happened?

TheTechRobo commented 2 years ago

I never really found a solution. It isn't a much-needed feature for me really, would just be nice to have a configurable depth, including "inf" for infinite.

JustAnotherArchivist commented 2 years ago

The depth is infinite by default, but grab-site hardcodes the --span-hosts-allow wpull option, which prevents recursion on off-site pages. So you need to reset that to the default empty value. Maybe --wpull-args='--span-hosts --span-hosts-allow ""' would do the trick. Not sure if there are further reasons that would prevent the recursion though.