ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Failed DNS resolutions are retried forever with --wpull-args=--retry-dns-error #129

Closed ethus3h closed 5 years ago

ethus3h commented 5 years ago

Hello! I'm encountering an issue with grab-site 1.7.0 and wpull 1.2.3 in macOS 10.13.6 (17G65; Darwin 17.7.0).

For testing to reproduce this issue, I ran: grab-site --no-dupespotter --concurrency=3 --wpull-args=--read-timeout=3600 --connect-timeout=20 --dns-timeout=20 --retry-connrefused --retry-dns-error --max-redirect=128 --content-on-error --tries=1024 http://www.joga.com/

I stopped it with control-C after a while.

Now, its wpull.log indicates a whole lot of tries for http://www.joga.com/:

$ cat wpull.log | grep 'ERROR - Fetching ‘http://www\.joga\.com/’ encountered an error: DNS resolution failed' | wc -l 
   36890

This is a lot more than the 1024 attempts requested by --tries.

(Another one I have, which I have left running, had --tries=1024 in the command line twice; that one shows:

$ cat wpull.log | grep 'ERROR - Fetching ‘http://www\.joga\.com/’ encountered an error: DNS resolution failed' | wc -l
 2767176

)

I wonder if --retry-dns-error tries an unlimited amount of times, or something?

(Or is this intended behavior / user error?)

Thanks!

ivan commented 5 years ago

Not intended for sure, but wpull (tested: 1.2.3) does appear to retry forever because of some logic in handle_error in wpull/processor/rule.py. It does url_item.set_status(Status.error) and I guess probably keeps pulling it out of the queue forever.

ethus3h commented 5 years ago

Hm, looks like the --retry-connrefused ones do this too.

$ cat wpull.log | grep 'ERROR - Fetching ‘https://discography\.drawnofdream\.com/’ encountered an error: \[Errno 61\] Connection refused' | wc -l
 1849050
ethus3h commented 5 years ago

I retried without the --retry-connrefused and --retry-dns-error arguments, so my command line is now like:

grab-site --no-dupespotter --concurrency=3 --wpull-args='--read-timeout=3600 --connect-timeout=20 --dns-timeout=20 --max-redirect=128 --content-on-error --tries=1024' --1 

Even so, I'm still getting this issue, just for some different error types.

$ cat wpull.log | grep 'ERROR - Fetching ‘https://drawnofdream\.com/’ encountered an error: \[Errno 8\] Exec format error' | wc -l
 1010899
$ cat wpull.log | grep 'ERROR - Fetching ‘https://www.marinkavandam\.com/’ encountered an error: \[Errno 1\] Operation not permitted' | wc -l
  582183
$ cat wpull.log | grep 'ERROR - Fetching ‘https://db\.starsam80\.net/spotify_test2\.raw’ encountered an error: Connect timed out\.' | wc -l
   19730
$ cat wpull.log | grep 'ERROR - Fetching ‘http://starsam80\.net/’ encountered an error: Connection closed.' | wc -l
  389282

Ignoring the problematic URLs doesn't do anything: it seems that the ignores file is not checked for changes between tries. (At 17:01, I changed the ignores file, leaving it like:

$ cat starsam80.net-2018-08-17-27edbbdd/ignores 
^https://archive.org/download/
^http://starsam80\.net/

but it's still trying to load http://starsam80.net/ at 17:13.)

This issue only seems to affect unfetchable URLs provided on the command line. When they're passed with an input file (-i), they give up after the requested number of retries. So, that can be used as a workaround for it.

ivan commented 5 years ago

Just to comment on the last part: ca8fd22c02885e8e3dfce20b609daaf1dae68e48 changed the ignore behavior to not apply ignores to URLs given on the command line, as part of a way to crawl a tumblr without hitting the homepages of other tumblrs.