ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
356 stars 72 forks source link

URLs are sometimes not retried correctly #507

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago

I've noticed that sometimes, URLs are not retried properly. The most recent example is job 172fw8g4egszevx4i56uu06cm. One of about 1700 such URLs on that job:

$ zstdgrep -F 'https://usc.gov.mm/?q=node/66' usc.gov.mm-inf-20210314-042931-172fw-meta.warc.gz
2021-03-14 04:33:31,023 - wpull.processor.web - INFO - Fetching ‘https://usc.gov.mm/?q=node/66’.
2021-03-14 04:33:51,038 - wpull.processor.base - ERROR - Fetching ‘https://usc.gov.mm/?q=node/66’ encountered an error: Connect timed out.

This URL was only attempted once and obviously not retrieved correctly. Further, no ignores matching this URL were present. So it should've been retried, yet it wasn't. I've seen another example of this in the past couple months but can't find it anymore.

I haven't looked into this in detail yet. One thing I noticed (but may be entirely irrelevant) is that all affected URLs on that job, per my crude check with a couple samples from wpull2-log-extract-errors in my little-things, are on https://usc.gov.mm/. Note that the job was started on HTTP, and the HTTPS server on that domain is actually broken.