apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.22k stars 295 forks source link

Implement crawler.teardown (exists in JS version) #651

Open Pijukatel opened 19 hours ago

Pijukatel commented 19 hours ago

Implement some way to stop crawler in obvious and controlled way from the user function. It should properly shutdown all resources and immediately stop crawler to send any requests. It should be mirroring the JS version.

Use case: User wants to stop crawler from within the user function.

Example of current workarounds for user:

  1. Add flag at the beginning of the user function and shortcut user function evaluation. if finished: return ... Drawback: Currently queued requests are still being send, but not processed.

  2. Call some private internals: await crawler._pool.abort() Drawback: Internal. Remaining tasks will still finish.

  3. Drop request provider await request_provider.drop() Drawback: Bunch of errors as existing tasks might still try to access request_provider()

Example of how this is solved in scrapy: https://docs.scrapy.org/en/2.11/faq.html#how-can-i-instruct-a-spider-to-stop-itself

janbuchar commented 19 hours ago

This has been discussed in https://github.com/apify/crawlee-python/discussions/506