apify / apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
https://docs.apify.com/sdk/python
Apache License 2.0
117 stars 11 forks source link

fix: possible infinity loop in Apify-Scrapy proxy middleware #259

Closed vdusek closed 1 month ago

vdusek commented 1 month ago

From the template log... it scrapes normally as it should...

...
2024-09-02T21:34:06.9322150Z [title_spider] [INFO] TitleSpider is parsing <200 https://apify.com/run-scrapy-in-cloud>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
2024-09-02T21:34:06.9455173Z [title_spider] [INFO] TitleSpider is parsing <200 https://docs.apify.com/academy/web-scraping-for-beginners>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
2024-09-02T21:34:06.9530763Z [title_spider] [INFO] TitleSpider is parsing <200 https://apify.com/success-stories>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
2024-09-02T21:34:07.0125374Z [title_spider] [INFO] TitleSpider is parsing <200 https://apify.com/templates/ts-crawlee-playwright-chrome>... ({"spider": "<TitleSpider 'title_spider' at 0x7f70c7ab07f0>"})
...

But when processing https://console.apify.com/robots.txt, it throws an exception in the proxy middleware, which is caught and logged:

2024-09-02T21:34:07.0410494Z [apify] [WARN] ApifyHttpProxyMiddleware: TunnelError occurred for request="<GET https://console.apify.com/robots.txt>", reason="Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}]", skipping...

but then it incorrectly returns the request object here:

if isinstance(exception, TunnelError):  
    Actor.log.warning(  
        f'ApifyHttpProxyMiddleware: TunnelError occurred for request="{request}", '  
        'reason="{exception}", skipping...'  
    )  
    return request  

Which causes it to be rescheduled, and we're stuck in a loop.

Also check the https://github.com/apify/actor-templates/pull/288 - where the tests are executed with alpha release from this branch.