apify / actor-templates

This project is the :house: home of Apify actor template projects to help users quickly get started.
https://apify.com/
25 stars 14 forks source link

Python Scrapy Actor uses Request Queue #184

Closed vdusek closed 1 year ago

vdusek commented 1 year ago

Description

Additional explanation of certain parts of code

Exceptions in run_until_complete

If we use run_until_complete in a custom-nested event loop like this:

try:
    event_loop.run_until_complete(foo_coroutine())
except BaseException:
    traceback.print_exc()

We must wrap it into a try block and call traceback.print_exc() in the except part if we want the exception to be propagated. Otherwise, the request processing will be terminated without any notice in the log or something and Scrapy will just continue with the next one. Which could be pretty confusing...

"Robots" requests

"Robots" requests (*/robots.txt) are bypassed directly from the Engine (through middlewares) to the Spider. They don't go through a Scheduler. It would be pretty hard to try to force them to go through the Request Queue. So in our Retry Middleware, we identify these requests and do not make any interaction with the Request Queue in such cases.

Ticket

Blocked by

vdusek commented 1 year ago

just one suggestion and few questions (haven't tried it locally)

Just in case, you would like to try it locally, it should be pretty easy (execution with default Actor input):

cd templates/python-scrapy/
virtualenv --python $(which python3.11) .venv
source .venv/bin/activate
pip install -r requirements.txt
apify run --purge