apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
3.82k stars 264 forks source link

how can I disable cache completely ? #369

Open 1hachem opened 1 month ago

1hachem commented 1 month ago

I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).

here is my attempt so far, I tried with persist_storage=False and purge_on_start=True in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.

async def main(
    website: str,
    include_links: list[str],
    exclude_links: list[str],
    depth: int = 5,
) -> str:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=depth,
    )
    dataset = await Dataset.open(
        configuration=Configuration(
            persist_storage=False,
            purge_on_start=True,
        ),
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:  # type: ignore
        # Extract data from the page.
        text = context.soup.get_text()

        await dataset.push_data({"content": text})

        # Enqueue all links found on the page.
        await context.enqueue_links(
            include=[Glob(url) for url in include_links],
            exclude=[Glob(url) for url in exclude_links],
        )

    # Run the crawler with the initial list of URLs.
    await crawler.run([website])
    data = await dataset.get_data()

    content = "\n".join([item["content"] for item in data.items])  # type: ignore

    return content

also is there a way to simple get the result of the crawl as a string, and not use Dataset ?

any help is appreciated 🤗 thank you in advance !

janbuchar commented 1 month ago

Hello and thank you for your interest in Crawlee! This seems closely related to #351. Could you please re-check that you get an empty string if you run this after removing the storage directory? I can imagine getting an empty string on a second run without deleting the storage (because of both persist_storage=False and purge_on_start functioning incorrectly), but what you're describing sounds strange.