apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://apify.github.io/crawlee-python/
Apache License 2.0
27 stars 1 forks source link

Automate default logging configuration for crawlers #214

Closed vdusek closed 5 days ago

vdusek commented 6 days ago

Currently, when a user runs their crawler, no log is printed, leaving users unaware of the crawler's progress and actions.

We have a log formatter already ready, but users need to configure it manually. Like this:

import asyncio
import logging

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.log_config import CrawleeLogFormatter

# Configuration the logging
handler = logging.StreamHandler()
handler.setFormatter(CrawleeLogFormatter(include_logger_name=True))
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(handler)

async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        await context.enqueue_links()
        data = {
            'url': context.request.url,
            'title': context.soup.title.string,
        }
        await context.push_data(data)

    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

Can we make this setup a default? Possible solution: importing any module from Crawlee should configure the root logger automatically.