Request fetching from `RequestQueue` is sometimes very slow

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Apache License 2.0

24 stars 1 forks source link

import asyncio import logging from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext logging.basicConfig(level=logging.INFO) async def main() -> None: crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: await context.enqueue_links(strategy='same-hostname') data = { 'request_url': context.request.url, 'soup_url': context.soup.url, 'soup_title': context.soup.title.string if context.soup.title else None, } await context.push_data(data) await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main())

In https://github.com/apify/crawlee-python/pull/235 we set ensure_consistency to False as a hotfix in the request queue.

The root cause of this behavior is in https://github.com/apify/crawlee-python/pull/186, before that, the RQ works fine, without waiting. Let's deep dive into that after the public launch and documentation is ready.

Code to reproduce it:

import asyncio
import logging

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
from crawlee.log_config import CrawleeLogFormatter

# Configuration the logging
handler = logging.StreamHandler()
handler.setFormatter(CrawleeLogFormatter(include_logger_name=True))
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(handler)

async def main() -> None:
    crawler = BeautifulSoupCrawler()
    dataset = await Dataset.open()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links()
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else '',
        }
        await dataset.push_data(data)

    await crawler.run(['https://crawlee.dev/'])

if __name__ == '__main__':
    asyncio.run(main())

apify / crawlee-python

Request fetching from `RequestQueue` is sometimes very slow #203

Questions