Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Implement some way to stop crawler in obvious and controlled way from the user function. It should properly shutdown all resources and immediately stop crawler to send any requests. It should be mirroring the JS version.
Use case:
User wants to stop crawler from within the user function.
Example of current workarounds for user:
Add flag at the beginning of the user function and shortcut user function evaluation.
if finished:
return
...
Drawback: Currently queued requests are still being send, but not processed.
Call some private internals:
await crawler._pool.abort()
Drawback: Internal. Remaining tasks will still finish.
Drop request provider
await request_provider.drop()
Drawback: Bunch of errors as existing tasks might still try to access request_provider()
Implement some way to stop crawler in obvious and controlled way from the user function. It should properly shutdown all resources and immediately stop crawler to send any requests. It should be mirroring the JS version.
Use case: User wants to stop crawler from within the user function.
Example of current workarounds for user:
Add flag at the beginning of the user function and shortcut user function evaluation. if finished: return ... Drawback: Currently queued requests are still being send, but not processed.
Call some private internals: await crawler._pool.abort() Drawback: Internal. Remaining tasks will still finish.
Drop request provider await request_provider.drop() Drawback: Bunch of errors as existing tasks might still try to access request_provider()
Example of how this is solved in scrapy: https://docs.scrapy.org/en/2.11/faq.html#how-can-i-instruct-a-spider-to-stop-itself