Closed ethus3h closed 5 years ago
Not intended for sure, but wpull (tested: 1.2.3) does appear to retry forever because of some logic in handle_error
in wpull/processor/rule.py
. It does url_item.set_status(Status.error)
and I guess probably keeps pulling it out of the queue forever.
Hm, looks like the --retry-connrefused
ones do this too.
$ cat wpull.log | grep 'ERROR - Fetching ‘https://discography\.drawnofdream\.com/’ encountered an error: \[Errno 61\] Connection refused' | wc -l
1849050
I retried without the --retry-connrefused
and --retry-dns-error
arguments, so my command line is now like:
grab-site --no-dupespotter --concurrency=3 --wpull-args='--read-timeout=3600 --connect-timeout=20 --dns-timeout=20 --max-redirect=128 --content-on-error --tries=1024' --1
Even so, I'm still getting this issue, just for some different error types.
$ cat wpull.log | grep 'ERROR - Fetching ‘https://drawnofdream\.com/’ encountered an error: \[Errno 8\] Exec format error' | wc -l
1010899
$ cat wpull.log | grep 'ERROR - Fetching ‘https://www.marinkavandam\.com/’ encountered an error: \[Errno 1\] Operation not permitted' | wc -l
582183
$ cat wpull.log | grep 'ERROR - Fetching ‘https://db\.starsam80\.net/spotify_test2\.raw’ encountered an error: Connect timed out\.' | wc -l
19730
$ cat wpull.log | grep 'ERROR - Fetching ‘http://starsam80\.net/’ encountered an error: Connection closed.' | wc -l
389282
Ignoring the problematic URLs doesn't do anything: it seems that the ignores file is not checked for changes between tries. (At 17:01, I changed the ignores file, leaving it like:
$ cat starsam80.net-2018-08-17-27edbbdd/ignores
^https://archive.org/download/
^http://starsam80\.net/
but it's still trying to load http://starsam80.net/
at 17:13.)
This issue only seems to affect unfetchable URLs provided on the command line. When they're passed with an input file (-i
), they give up after the requested number of retries. So, that can be used as a workaround for it.
Just to comment on the last part: ca8fd22c02885e8e3dfce20b609daaf1dae68e48 changed the ignore behavior to not apply ignores to URLs given on the command line, as part of a way to crawl a tumblr without hitting the homepages of other tumblrs.
Hello! I'm encountering an issue with grab-site 1.7.0 and wpull 1.2.3 in macOS 10.13.6 (17G65; Darwin 17.7.0).
For testing to reproduce this issue, I ran:
grab-site --no-dupespotter --concurrency=3 --wpull-args=--read-timeout=3600 --connect-timeout=20 --dns-timeout=20 --retry-connrefused --retry-dns-error --max-redirect=128 --content-on-error --tries=1024 http://www.joga.com/
I stopped it with control-C after a while.
Now, its
wpull.log
indicates a whole lot of tries forhttp://www.joga.com/
:This is a lot more than the 1024 attempts requested by
--tries
.(Another one I have, which I have left running, had
--tries=1024
in the command line twice; that one shows:)
I wonder if
--retry-dns-error
tries an unlimited amount of times, or something?(Or is this intended behavior / user error?)
Thanks!