Open vdusek opened 1 week ago
In https://github.com/apify/crawlee-python/pull/235 we set ensure_consistency
to False
as a hotfix in the request queue.
The root cause of this behavior is in https://github.com/apify/crawlee-python/pull/186, before that, the RQ works fine, without waiting. Let's deep dive into that after the public launch and documentation is ready.
Code to reproduce it:
import asyncio
import logging
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
from crawlee.log_config import CrawleeLogFormatter
# Configuration the logging
handler = logging.StreamHandler()
handler.setFormatter(CrawleeLogFormatter(include_logger_name=True))
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(handler)
async def main() -> None:
crawler = BeautifulSoupCrawler()
dataset = await Dataset.open()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links()
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else '',
}
await dataset.push_data(data)
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
RequestQueue
is sometimes very slow and can get stuck for a while.In the logs, there are many lines like this:
This logging is from the following code block: crawlee/storages/request_queue.py#L541:L546
Questions