Closed Svekla closed 7 years ago
It looks like wpull has 403
in NO_DOCUMENT_STATUS_CODES
(the entire list is (401, 403, 404, 405, 410,)
), and in this case it calls into self._result_rule.handle_no_document
instead of self._result_rule.handle_document_error
, and handle_no_document
does not ever retry the URL. For other codes, it would put the URL at the end of the queue to retry up to 3 times by default.
There is a wpull bug about adding something to customize which codes are treated as a permanent error: https://github.com/chfoo/wpull/issues/143
I think I can add something to customize NO_DOCUMENT_STATUS_CODES
in grab-site. Meanwhile, if you want a fix right now, you can grep -r
for NO_DOCUMENT_STATUS_CODES
in your Python site-packages directory/grab-site venv that wpull ended up in, and remove 403 from the list of codes. Then grab-site should retry 403s (eventually, after everything else currently in the queue is handled.)
Thank you! How can I modify wait time after retry, and number of retries grab-site does? Will --wpull-args="--tries 10\"--waitretry 60\""
work?
--wpull-args="--tries 10 --waitretry 60"
but I don't really know if waitretry
affects this scenario; you'd have to experiment.
Thank you again.
So I edited web.py: (I'm on Ubuntu 16.10)
Binary file gs-venv/lib/python3.4/site-packages/wpull/processor/__pycache__/web.cpython-34.pyc matches
gs-venv/lib/python3.4/site-packages/wpull/processor/web.py: NO_DOCUMENT_STATUS_CODES = (401, 404, 405, 410,)
gs-venv/lib/python3.4/site-packages/wpull/processor/web.py: self._no_document_codes = WebProcessor.NO_DOCUMENT_STATUS_CODES
But wpull still saves 403 Forbidden to WARC:
Is it normal? Will wpull replace these URLs after retry?
EDIT: I'm trying on new version of grab-site. No 403 so far.
If the page can be retrieved, it should eventually create another WARC record with a valid page. I don't know if WARC players generally handle this properly, because some might always land you on the first record. Hopefully they work though, and I would file bugs if you can't see the other record.
If you end up needing to strip out 403 responses, a while ago I used some WARC post-processing to strip out 404 responses: https://github.com/ArchiveTeam/greader-grab/blob/6f22c350cdd5260956f0da738790cd222d057edf/warc2warc_greader.py#L74 - this would have to modified: s/404/403/
and some unrelated code might need to be removed.
Maybe chfoo knows if there's a better way to convince wpull to not write WARC records for some responses.
(BTW, I have added a --permanent-error-status-codes=
, so hopefully you won't have to modify the wpull source code for that any more.)
It works, thank you:
2017-02-08 21:41:20,846 - wpull.processor.web - INFO - Fetching ‘http://XXX/viewtopic.php?f=10&t=12998’.
2017-02-08 21:41:21,056 - wpull.processor.web - INFO - Fetched ‘http://XXX/viewtopic.php?f=10&t=12998’: 403 Forbidden. Length: 289 [text/html; charset=iso-8859-1].
2017-02-10 04:07:50,364 - wpull.processor.web - INFO - Fetching ‘http://XXX/viewtopic.php?f=10&t=12998’.
2017-02-10 04:07:53,002 - wpull.processor.web - INFO - Fetched ‘http://XXX/viewtopic.php?f=10&t=12998’: 200 OK. Length: unspecified [text/html; charset=UTF-8].
Hi! I want to archive forum that have an anti-bot protection that when grabing site, every now and then returns 403 Forbidden error instead of site for around 30 seconds, and after that ~30s it normally display forum. I've tried using different delays and concurrency to not trigger it, but it still happens, so I have a few questions. Does grab-site try to download these 403 URLs again after some time or leave it as 403? If it leave it as 403, how can I archive that forum?
Thank you very much for help :)