apify / apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
https://docs.apify.com/sdk/python
Apache License 2.0
115 stars 11 forks source link

Twisted/CRITICAL: builtins.TypeError: can't pickle Selector objects (Scrapy) #189

Closed honzajavorek closed 6 months ago

honzajavorek commented 6 months ago

My spider https://github.com/juniorguru/plucker/blob/26d1758e310b8b2451541516cf4447e4a5e4a11a/juniorguru_plucker/jobs_jobscz/spider.py runs just fine with Scrapy, but fails with critical errors when teaming up with Apify.

See exception details 💌 ``` [twisted] CRITICAL Unhandled error in Deferred: Traceback (most recent call last): File "/Users/honza/.local/share/mise/installs/python/3.11/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/asyncioreactor.py", line 271, in _onTimer self.runUntilCurrent() File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent call.func(*call.args, **call.kw) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/task.py", line 680, in _tick taskObj._oneWorkUnit() --- --- File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/twisted/internet/task.py", line 526, in _oneWorkUnit result = next(self._iterator) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 102, in work = (callable(elem, *args, **named) for elem in iterable) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/scraper.py", line 298, in _process_spidermw_output self.crawler.engine.crawl(request=output) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 290, in crawl self._schedule_request(request, self.spider) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 297, in _schedule_request if not self.slot.scheduler.enqueue_request(request): # type: ignore[union-attr] File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/apify/scrapy/scheduler.py", line 87, in enqueue_request apify_request = to_apify_request(request, spider=self.spider) File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/apify/scrapy/requests.py", line 76, in to_apify_request scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode() File "/Users/honza/Projects/juniorguru-plucker/.venv/lib/python3.11/site-packages/parsel/selector.py", line 532, in __getstate__ raise TypeError("can't pickle Selector objects") builtins.TypeError: can't pickle Selector objects ```

When debugging the problem, I figured out the following line causes the problem:

scrapy_request_dict_encoded = codecs.encode(pickle.dumps(scrapy_request_dict), 'base64').decode()

Inspecting problematic dicts, the culprit seems to be the fact that I pass a response object around:

yield response.follow(
    script_url,
    callback=self.parse_job_widget_script,
    cb_kwargs=dict(item=item, html_response=response, track_id=track_id),
)

Then the response comes in the dict like this:

{'body': b'',
 'callback': 'parse_job_widget_script',
 'cb_kwargs': {'html_response': <200 https://example.com/.../>,
               'item': {...}}}

The <200 https://example.com/.../> is a representation of the Response, which probably cannot be pickled, or at least some Selector objects in there.

I don't think you can do much about it, it's probably a limitation of delegating the request mechanics to an external system such as Apify. If you need to serialize and later deserialize the request, there's just no way I could pass around something which Python cannot pickle.

So I think the only solution here is to fail nicely. The line which pickles the request should catch the exception and provide a nicer error message which explains what is happening and why, ideally with some guidance on how to avoid the problem. I'll get back here if I come up with a workaround.

honzajavorek commented 6 months ago

I figured out I don't need whole response, so I was able to fix this by a change like this: https://github.com/juniorguru/plucker/commit/a0cabe8f14a4f9959051c18fa09297b15bbb9d27

vdusek commented 6 months ago

Thank you @honzajavorek for reporting this. I've opened a PR https://github.com/apify/apify-sdk-python/pull/191 which should improve the error handling in to_apify_request. Also, the ApifyScheduler should let the user know, that the request was not scheduled due to this reason.