Refactor handling of `ignored_http_status_codes` and `SessionError`

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

https://crawlee.dev/python/

Apache License 2.0

4.65k stars 319 forks source link

Refactor handling of `ignored_http_status_codes` and `SessionError` #708

Open janbuchar opened 1 week ago

janbuchar commented 1 week ago

Technical debt introduced in https://github.com/apify/crawlee-python/pull/167
Should probably be ported over to JS afterwards (v4?)

The reasoning behind the change was that some errors (such as 401) are automatically considered SessionError and http-based crawlers don't take ignore_htttp_error_status_codes into account for them. While uncommon, it should be possible to explicitly ignore any status code.

vdusek commented 6 days ago

rather then

if (
    context.session
    and status_code not in self._http_client._ignore_http_error_status_codes  # noqa: SLF001
    and context.session.is_blocked_status_code(status_code=status_code)
):

use something like

if context.session and context.session.is_blocked_status_code(
    status_code=status_code,
    additional_blocked_status_codes=self._http_client.additional_blocked_status_codes,
    ignore_http_error_status_codes=self._http_client.ignore_http_error_status_codes,
):

or come up with something better