Twisted/CRITICAL: builtins.TypeError: can't pickle Selector objects (Scrapy)

honzajavorek commented 6 months ago

My spider https://github.com/juniorguru/plucker/blob/26d1758e310b8b2451541516cf4447e4a5e4a11a/juniorguru_plucker/jobs_jobscz/spider.py runs just fine with Scrapy, but fails with critical errors when teaming up with Apify.

See exception details 💌

``` [twisted] CRITICAL Unhandled error in Deferred: Traceback (most recent call last): File "/Users/honza/.local/share/mise/installs/python/3.11/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/asyncioreactor.py", line 271, in _onTimer self.runUntilCurrent() File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent call.func(*call.args, **call.kw) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/task.py", line 680, in _tick taskObj._oneWorkUnit() --- --- File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit result = next(self._iterator) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 102, in work = (callable(elem, *args, **named) for elem in iterable) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/scraper.py", line 298, in _process_spidermw_output self.crawler.engine.crawl(request=output) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 290, in crawl self._schedule_request(request, self.spider) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 297, in _schedule_request if not self.slot.scheduler.enqueue_request(request): # type: ignore[union-attr] File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/apify/scrapy/scheduler.py", line 87, in enqueue_request apify_request = to_apify_request(request, spider=self.spider) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/apify/scrapy/requests.py", line 76, in to_apify_request scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode() File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/parsel/selector.py", line 532, in __getstate__ raise TypeError("can't pickle Selector objects") builtins.TypeError: can't pickle Selector objects ```

When debugging the problem, I figured out the following line causes the problem:

scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode()

Inspecting problematic dicts, the culprit seems to be the fact that I pass a response object around:

yield response.follow(
    script_url,
    callback=self.parse_job_widget_script,
    cb_kwargs=dict(item=item, html_response=response, track_id=track_id),
)

Then the response comes in the dict like this:

{'body': b'',
 'callback': 'parse_job_widget_script',
 'cb_kwargs': {'html_response': <200 https://example.com/.../>,
               'item': {...}}}

The <200 https://example.com/.../> is a representation of the Response, which probably cannot be pickled, or at least some Selector objects in there.

I don't think you can do much about it, it's probably a limitation of delegating the request mechanics to an external system such as Apify. If you need to serialize and later deserialize the request, there's just no way I could pass around something which Python cannot pickle.

So I think the only solution here is to fail nicely. The line which pickles the request should catch the exception and provide a nicer error message which explains what is happening and why, ideally with some guidance on how to avoid the problem. I'll get back here if I come up with a workaround.

honzajavorek commented 6 months ago

I figured out I don't need whole response, so I was able to fix this by a change like this: https://github.com/juniorguru/plucker/commit/a0cabe8f14a4f9959051c18fa09297b15bbb9d27

vdusek commented 6 months ago

Thank you @honzajavorek for reporting this. I've opened a PR https://github.com/apify/apify-sdk-python/pull/191 which should improve the error handling in to_apify_request. Also, the ApifyScheduler should let the user know, that the request was not scheduled due to this reason.

apify / apify-sdk-python

Twisted/CRITICAL: builtins.TypeError: can't pickle Selector objects (Scrapy) #189