apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://apify.github.io/crawlee-python/
Apache License 2.0
24 stars 1 forks source link

Request fetching from `RequestQueue` is sometimes very slow #203

Open vdusek opened 1 week ago

vdusek commented 1 week ago
import asyncio
import logging

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

logging.basicConfig(level=logging.INFO)

async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links(strategy='same-hostname')
        data = {
            'request_url': context.request.url,
            'soup_url': context.soup.url,
            'soup_title': context.soup.title.string if context.soup.title else None,
        }
        await context.push_data(data)

    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

Questions

vdusek commented 2 days ago

In https://github.com/apify/crawlee-python/pull/235 we set ensure_consistency to False as a hotfix in the request queue.

The root cause of this behavior is in https://github.com/apify/crawlee-python/pull/186, before that, the RQ works fine, without waiting. Let's deep dive into that after the public launch and documentation is ready.

Code to reproduce it:

import asyncio
import logging

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
from crawlee.log_config import CrawleeLogFormatter

# Configuration the logging
handler = logging.StreamHandler()
handler.setFormatter(CrawleeLogFormatter(include_logger_name=True))
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(handler)

async def main() -> None:
    crawler = BeautifulSoupCrawler()
    dataset = await Dataset.open()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links()
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else '',
        }
        await dataset.push_data(data)

    await crawler.run(['https://crawlee.dev/'])

if __name__ == '__main__':
    asyncio.run(main())