apify / apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
https://docs.apify.com/sdk/python
Apache License 2.0
115 stars 11 forks source link

Scrapy integration silently throws away redirects #185

Closed honzajavorek closed 6 months ago

honzajavorek commented 6 months ago

I'm debugging a situation where the same Scrapy spider produces 720 items locally, but 370 through Apify. After a day of detective work I figured out that the Scrapy-Apify integration probably doesn't handle redirects properly. Take this example:

from scrapy import Spider as BaseSpider

class Spider(BaseSpider):
    name = "minimal-example"

    start_urls = [
        "https://httpbin.org/redirect-to?url=https%3A%2F%2Fhonzajavorek.cz%2F",
    ]

    def parse(self, response):
        raise NotImplementedError("Not implemented yet")

Running the code through Scrapy's crawl command gives me:

Running the code through Apify integration gives me:

It looks like the Apify integration silently throws away redirects!

honzajavorek commented 6 months ago

I expect it could be something about the integration, about redirects in general, but perhaps also something about the fact that the initial request points to a different domain than the target URL of the redirect.

Not sure whether this is something the Apify platform handles differently than redirects within a single domain. In my minimal example it's httpbin.org redirecting to honzajavorek.cz, but in my production code it's similar, with subdomains: www.example.com redirecting to foo.example.com Even if this would be the problem,

vdusek commented 6 months ago

Hi @honzajavorek, thank you for reporting the problem. I suppose the cause of this behavior could be that Scrapy's redirect middleware does not communicate correctly with our Scheduler/Request Queue. It's possible that we would have to modify Scrapy's redirect middleware in the similar way we modified the ProxyMiddleware. I'll look into this this week.

honzajavorek commented 6 months ago

Thanks for looking into this! I'm happy to help or test any changes on my production.

image

vdusek commented 6 months ago

Update after today's investigation.

The cause of the problem seems not to be in RedirectMiddleware itself but in the downloader middlewares in general. Probably any DownloaderMiddleware might suffer from these kinds of challenges.

It's possible, that if any DownloaderMiddleware (in process_response, but maybe in process_request as well) returns a Request, there is an issue. From Scrapy documentation:

process_response() - If it returns a Request object, the middleware chain is halted, and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

process_request() - If it returns a Request object, Scrapy will stop calling process_request() methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

While the request is rescheduled, it somehow bypasses the scheduler, or I'm not sure, but the request doesn't appear in the request queue. That's the reason why in this case the redirected request is not processed.

honzajavorek commented 6 months ago

I'm trying the pre-releases. It looks like the redirect work 🎉 in most cases. But I'm getting a lot of TypeError: can't pickle Selector objects. Full stack trace below:

Stack Trace ``` Unhandled error in Deferred: [twisted] CRITICAL Unhandled error in Deferred: Traceback (most recent call last): File "/Users/honza/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/twisted/internet/asyncioreactor.py", line 271, in _onTimer self.runUntilCurrent() File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent call.func(*call.args, **call.kw) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/twisted/internet/task.py", line 680, in _tick taskObj._oneWorkUnit() --- --- File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit result = next(self._iterator) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/scrapy/utils/defer.py", line 102, in work = (callable(elem, *args, **named) for elem in iterable) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/scrapy/core/scraper.py", line 298, in _process_spidermw_output self.crawler.engine.crawl(request=output) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/scrapy/core/engine.py", line 290, in crawl self._schedule_request(request, self.spider) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/scrapy/core/engine.py", line 297, in _schedule_request if not self.slot.scheduler.enqueue_request(request): # type: ignore[union-attr] File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/apify/scrapy/scheduler.py", line 87, in enqueue_request apify_request = to_apify_request(request, spider=self.spider) File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/apify/scrapy/requests.py", line 75, in to_apify_request scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode() File "/Users/honza/Library/Caches/pypoetry/virtualenvs/juniorguru-plucker-re1lP3cl-py3.11/lib/python3.11/site-packages/parsel/selector.py", line 532, in __getstate__ raise TypeError("can't pickle Selector objects") builtins.TypeError: can't pickle Selector objects ```
vdusek commented 6 months ago

Summary: We discussed it with Honza on Discord. The TypeError: can't pickle Selector objects is not related to the current changes. Thanks to the Redirect Middleware working we've been able to move further and revealed another problem. I'll try to address it next week.

vdusek commented 6 months ago

Closing, since it was resolved in https://github.com/apify/apify-sdk-python/pull/186 & https://github.com/apify/actor-templates/pull/272.