apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.64k stars 319 forks source link

Crawler doesn't respect `configuration` argument #539

Open tlinhart opened 2 months ago

tlinhart commented 2 months ago

Consider this sample program:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext

async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})

async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    crawler = ParselCrawler(request_handler=default_handler, configuration=config)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)

if __name__ == "__main__":
    asyncio.run(main())

The configuration argument given to ParselCrawler is not respected, during the run it creates the ./storage directory and persist all the (meta)data. I have to work around it by overriding the global configuration likes this:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext

async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})

async def main() -> None:
    config = Configuration.get_global_configuration()
    config.persist_storage = False
    config.write_metadata = False
    crawler = ParselCrawler(request_handler=default_handler)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)

if __name__ == "__main__":
    asyncio.run(main())
janbuchar commented 2 months ago

Hello, and thanks for the reproduction! It seems that the problem is here:

https://github.com/apify/crawlee-python/blob/master/src/crawlee/storages/_creation_management.py#L122-L132

It looks like service_container.get_storage_client does not consider the adjusted configuration.

Also, we have a test for this - https://github.com/apify/crawlee-python/blob/master/tests/unit/basic_crawler/test_basic_crawler.py#L630-L639 - which probably fails because we're looking inside a different storage directory than the global one.