ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Errors on initial URLs are retried forever #154

Closed JustAnotherArchivist closed 5 years ago

JustAnotherArchivist commented 5 years ago

This is similar to #129 but broader in scope.

If a job is started for a URL that fails repeatedly, e.g. an inexistent domain or a host that's currently timing out, grab-site never stops retrying it.

The problem lies here:

https://github.com/ArchiveTeam/grab-site/blob/5e75c56a7d6ee405083b2f0c3534d67b2208edd8/libgrabsite/wpull_hooks.py#L355-L357

This should instead be return verdict. accept_urls is always called, even if the wpull-internal filters already decided that a URL shouldn't be grabbed. In this particular case, that would be due to the TriesFilter, which matches once the URL has been tried three times (because grab-site passes --tries 3 to wpull). But the hook always returns True for the initial URL, so it's retried indefinitely.