apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.34k stars 658 forks source link

crawler fail on second run (env CRAWLEE_PURGE_ON_START not used?) #1602

Open terchris opened 2 years ago

terchris commented 2 years ago

Which package is this bug report for? If unsure which one to select, leave blank

No response

Issue description

I suspect that the client.purge is not run when crawlee starts as described in the doc. Setting CRAWLEE_PURGE_ON_START=true or false has no effect. All the files are still in key_value_stores/default/ eg: SDK_CRAWLER_STATISTICS_12.json . I have set CRAWLEE_STORAGE_DIR="tmpfilesystem/crawlee" so it might be related.

If I delete the directory tmpfilesystem/crawlee and run the code below. This works just fine, the website is scraped and the title of the website is displayed. The second time the code is run it does not work. If I delete all fines and try again, then it works.

This is debugging from the first run:

DEBUG CheerioCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: deb916eb-2112-4a46-9e63-80c90cdccd1c
INFO  CheerioCrawler: Starting the crawl
DEBUG CheerioCrawler:SessionPool: Created new Session - session_4ErkOlXe8x
INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
INFO  CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1297,"requestsFinishedPerMinute":44,"requestsFailedPerMinute":0,"requestTotalDurationMillis":1297,"requestsTotal":1,"crawlerRuntimeMillis":1368}
crawlee test passed. can webscrape website: https://www.smartebyernorge.no Title=Smarte Byer Norge

This is debugging from the second run:

DEBUG CheerioCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: deb916eb-2112-4a46-9e63-80c90cdccd1c
DEBUG CheerioCrawler:SessionPool: Recreating state from KeyValueStore {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG CheerioCrawler:SessionPool: 1 active sessions loaded from KeyValueStore
INFO  CheerioCrawler: Starting the crawl
INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_1"}
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_1"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_1"}
INFO  CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":55}

Code sample

/*
.env
CRAWLEE_STORAGE_DIR="tmpfilesystem/crawlee"
CRAWLEE_MEMORY_MBYTES=2048
CRAWLEE_LOG_LEVEL=DEBUG

CRAWLEE_PURGE_ON_START=true

*/

import { CheerioCrawler } from 'crawlee';

async function crawlee_test() {

    let testResult = {
        testMessage: "",
        testPassed: false
    };

    let returnString = "crawlee test";
    let websiteListToCrawl = ["https://www.smartebyernorge.no"];

    const crawler = new CheerioCrawler({

        minConcurrency: 10,
        maxConcurrency: 50,

        // On error, retry each page at most once.
        maxRequestRetries: 1,

        // Increase the timeout for processing of each page.
        requestHandlerTimeoutSecs: 30,

        // Limit to 10 requests per one crawl
        maxRequestsPerCrawl: 10,

        async requestHandler({ request, $ }) {
            //console.log(`Processing ${request.url}...`);

            // Extract data from the page using cheerio.
            const title = $('title').text();
            //let pageH1 = $('h1').text().trim();
            //let pageP1 = $('p').text().trim();

            returnString = returnString + " passed. can webscrape website: " + request.url + " Title=" + title;
            testResult.testMessage = returnString;
            testResult.testPassed = true;

        },

        // This function is called if the page processing failed more than maxRequestRetries + 1 times.
        failedRequestHandler({ request }) {

            returnString = returnString + " Failed. can NOT webscrape website: " + request.url;
            testResult.testMessage = returnString;
            testResult.testPassed = false;

        },
    });

    // Run the crawler and wait for it to finish.
    await crawler.run(websiteListToCrawl);

    return testResult;

}

async function do_test() {

    let testResult = {
        testMessage: "",
        testPassed: false
    };

    testResult = await crawlee_test();
    // wait a minute before next test
    await new Promise(resolve => setTimeout(resolve, 60000));
    testResult = await crawlee_test();    
}

Package version

crawlee@3.1.0

Node.js version

v16.17.0

Operating system

mac os

Apify platform

Priority this issue should have

Medium (should be fixed soon)

I have tested this on the next release

No response

Other context

No response

xialer commented 1 year ago

I encountered the same problem, if the call twice, the second call to requestHandler always does not respond.

Then I realized that the requestQueue was probably cached.

So I tried to add an

await requestQueue.drop(); at the end to clear the cache and it seems to work.

chris2k2 commented 1 year ago

I have a related issue in crawlee 3.3.3. The Datasets seem indeed not be purged by PURGE_ON_START or purgeDefaultStorage. I wrote this small test to actually confirm it:

it("shows that purgeDefaultstorage doesn't do anything?", async () => {
        let crawler = new CheerioCrawler({
            async requestHandler({}) {
                await Dataset.pushData({item: "asdf"});
            }
        });
        await crawler.run(["http://www.google.de"]);
        await purgeDefaultStorages();
        await crawler.run(["http://www.bing.de"]);

        expect((await Dataset.getData()).count).to.be.eq(1);
    });

So I believe its indeed a Bug and the purge commands don't seem to work. However I believe @terchris is running against the problem that the requestQueue needs to be dropped as @xialer pointed out.

ehsaaniqbal commented 1 year ago

It still seems to be an issue. Any updates?

B4nan commented 1 year ago

Purging works as expected, the problem here is its rather internal API that is not supposed to be working as you guys are trying to use it - crawlee purges default storages automatically (in other words, the PURGE_ON_START defaults to true), and it is supposed to happen only once, as this purgeDefaultStorage method is called multiple times from various places and we want it to execute only once (on the very first call).

I guess we could rework this a bit to support explicit purging too. For now you can try this:

const config = Configuration.getGlobalConfig();
const client = config.getStorageClient();
await client.purge?.();
germanattanasio commented 1 year ago

I have the same problem reported here and have tried the solution you proposed @B4nan but no luck.

const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
  const crawler = new PlaywrightCrawler({
    launchContext: { launchOptions: { headless: true } },
    maxRequestRetries: 1,
    requestHandlerTimeoutSecs: 20,
    maxRequestsPerCrawl: 20,
    async requestHandler({ request, page, enqueueLinks }) {
      try {
        const html = await page.evaluate('document.body.innerHTML');
        // Publish this html
        onDocument(html);

        // If the page is part of a seed, visit the links
        await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
      } catch (err) {
        log.warn('Error processing url: ' + request.url);
      }
    },
  });

  await crawler.addRequests([seedUrl]);
  await crawler.run();

  try {
    const config = Configuration.getGlobalConfig();
    const client = config.getStorageClient();
    await client.purge?.();
  } catch (err) {
    log.warn('Failed to purge storage client');
  }
}

Running this fails because the second onDocument is never called. The page was already crawled.

test("crawl multiple URLs", async () => {
   const onDocument = jest.fn();

   await crawlPage("https://moveo.ai", onDocument);
   expect(onDocument).toHaveBeenCalled();

   const onDocumentSecond = jest.fn();
   await crawlPage("https://moveo.ai", onDocument);
   expect(onDocumentSecond).toHaveBeenCalled();
});
germanattanasio commented 1 year ago

I actually get an error where the purge() is trying to delete a file.

Could not find file at /storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json
B4nan commented 1 year ago

Running this fails because the second onDocument is never called. The page was already crawled.

That test seems to be wrong, you are not passing the onDocumentSecond in the second parameter, so it won't be called. Should be this instead:

test("crawl multiple URLs", async () => {
   const onDocument = jest.fn();
   await crawlPage("https://moveo.ai/", onDocument);
   expect(onDocument).toHaveBeenCalled();

   const onDocumentSecond = jest.fn();
   await crawlPage("https://moveo.ai/", onDocumentSecond); // <--
   expect(onDocumentSecond).toHaveBeenCalled();
});

I actually get an error where the purge() is trying to delete a file.

And you get that from some actual usage, or in a test case? Are you calling that method in parallel?

germanattanasio commented 1 year ago

UPDATE: @B4nan, You are right, in my attempt to cleanup the code to paste here I made a typo. I actually check onDocumentSecond in my second expect().


The first time the method runs, it finds multiple pages, so onDocument is called around 22 times (maxRequestsPerCrawl=20).

The second time, the new mocked function onDocumentSecond isn't called because some state from the first run is stored somewhere, possibly in a variable within a module. If we had a teardown(), purge, or similar method to clean up the entire state, I believe this code would function properly.

I've tried various alternatives that I found in several issues similar to this one. I'm currently documenting them and planning to initiate a discussion with my findings.

B4nan commented 1 year ago

Are you sure you are using latest version? The run method itself is already doing the necessary cleanup:

https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L648-L655

germanattanasio commented 1 year ago

Yeah. I'm using 3.4.0

joesamcoke commented 1 year ago

I'm having the same problem in 3.4.2

germanattanasio commented 1 year ago

For future reference. I solved the problem using the persistStorage: false configuration. You need to set it each time you instantiate a PlaywrightCrawler instance.

const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
  const crawler = new PlaywrightCrawler({
    launchContext: { launchOptions: { headless: true } },
    maxRequestRetries: 1,
    requestHandlerTimeoutSecs: 20,
    maxRequestsPerCrawl: 20,
    async requestHandler({ request, page, enqueueLinks }) {
      try {
        const html = await page.evaluate('document.body.innerHTML');
        // Publish this html
        onDocument(html);

        // If the page is part of a seed, visit the links
        await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
      } catch (err) {
        log.warn('Error processing url: ' + request.url);
      }
    },
  }, 
  new Configuration({ persistStorage: false })); // <---- Configuration

  await crawler.addRequests([seedUrl]);
  await crawler.run();

}
chris2k2 commented 10 months ago
    it("shows that purgeDefaultstorage doesn't do anything?", async () => {
        let crawler = new CheerioCrawler({
                async requestHandler({}) {
                    await Dataset.pushData({item: "asdf"});
                }
            },
            new Configuration({persistStorage: false})
        );
        await crawler.run(["http://www.google.de"]);
        await purgeDefaultStorages();
        await crawler.run(["http://www.google.de"]);

        expect((await Dataset.getData()).count).to.be.eq(1);
    });

I tried this @germanattanasio, but this unit test stills fails, where I am wrong?

B4nan commented 10 months ago
    expect((await Dataset.getData()).count).to.be.eq(1);

This call will use global config, therefore the same storage. You have three options:

  1. instead of using local config instance, modify the global one via Configuration.set() (or use env vars, namely https://crawlee.dev/docs/guides/configuration#crawlee_purge_on_start)
  2. create the dataset instance that respects your local config via Dataset.open(null, { config }) and call getData on that
  3. use crawler.getData() which respects the config you pass to the crawler