ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Help on 403 Forbidden errors #97

Closed Svekla closed 7 years ago

Svekla commented 7 years ago

Hi! I want to archive forum that have an anti-bot protection that when grabing site, every now and then returns 403 Forbidden error instead of site for around 30 seconds, and after that ~30s it normally display forum. I've tried using different delays and concurrency to not trigger it, but it still happens, so I have a few questions. Does grab-site try to download these 403 URLs again after some time or leave it as 403? If it leave it as 403, how can I archive that forum?

Thank you very much for help :)

ivan commented 7 years ago

It looks like wpull has 403 in NO_DOCUMENT_STATUS_CODES (the entire list is (401, 403, 404, 405, 410,)), and in this case it calls into self._result_rule.handle_no_document instead of self._result_rule.handle_document_error, and handle_no_document does not ever retry the URL. For other codes, it would put the URL at the end of the queue to retry up to 3 times by default.

There is a wpull bug about adding something to customize which codes are treated as a permanent error: https://github.com/chfoo/wpull/issues/143

I think I can add something to customize NO_DOCUMENT_STATUS_CODES in grab-site. Meanwhile, if you want a fix right now, you can grep -r for NO_DOCUMENT_STATUS_CODES in your Python site-packages directory/grab-site venv that wpull ended up in, and remove 403 from the list of codes. Then grab-site should retry 403s (eventually, after everything else currently in the queue is handled.)

Svekla commented 7 years ago

Thank you! How can I modify wait time after retry, and number of retries grab-site does? Will --wpull-args="--tries 10\"--waitretry 60\""work?

ivan commented 7 years ago

--wpull-args="--tries 10 --waitretry 60" but I don't really know if waitretry affects this scenario; you'd have to experiment.

Svekla commented 7 years ago

Thank you again.

Svekla commented 7 years ago

So I edited web.py: (I'm on Ubuntu 16.10)

Binary file gs-venv/lib/python3.4/site-packages/wpull/processor/__pycache__/web.cpython-34.pyc matches
gs-venv/lib/python3.4/site-packages/wpull/processor/web.py:    NO_DOCUMENT_STATUS_CODES = (401, 404, 405, 410,)
gs-venv/lib/python3.4/site-packages/wpull/processor/web.py:        self._no_document_codes = WebProcessor.NO_DOCUMENT_STATUS_CODES

But wpull still saves 403 Forbidden to WARC: image

Is it normal? Will wpull replace these URLs after retry?

EDIT: I'm trying on new version of grab-site. No 403 so far.

ivan commented 7 years ago

If the page can be retrieved, it should eventually create another WARC record with a valid page. I don't know if WARC players generally handle this properly, because some might always land you on the first record. Hopefully they work though, and I would file bugs if you can't see the other record.

If you end up needing to strip out 403 responses, a while ago I used some WARC post-processing to strip out 404 responses: https://github.com/ArchiveTeam/greader-grab/blob/6f22c350cdd5260956f0da738790cd222d057edf/warc2warc_greader.py#L74 - this would have to modified: s/404/403/ and some unrelated code might need to be removed.

Maybe chfoo knows if there's a better way to convince wpull to not write WARC records for some responses.

(BTW, I have added a --permanent-error-status-codes=, so hopefully you won't have to modify the wpull source code for that any more.)

Svekla commented 7 years ago

It works, thank you:

2017-02-08 21:41:20,846 - wpull.processor.web - INFO - Fetching ‘http://XXX/viewtopic.php?f=10&t=12998’.
2017-02-08 21:41:21,056 - wpull.processor.web - INFO - Fetched ‘http://XXX/viewtopic.php?f=10&t=12998’: 403 Forbidden. Length: 289 [text/html; charset=iso-8859-1].
2017-02-10 04:07:50,364 - wpull.processor.web - INFO - Fetching ‘http://XXX/viewtopic.php?f=10&t=12998’.
2017-02-10 04:07:53,002 - wpull.processor.web - INFO - Fetched ‘http://XXX/viewtopic.php?f=10&t=12998’: 200 OK. Length: unspecified [text/html; charset=UTF-8].