apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.58k stars 666 forks source link

Requests are processed multiple times #1216

Closed rubydev closed 2 years ago

rubydev commented 3 years ago

Describe the bug Lately, I've noticed, that in 2 different actors with Puppeteer where is set maxConcurrency around 80-100, is sometimes in dataset much more results than handled requests. For example 19.4k results, but handled 18.1k requests. Which should not be possible. Some requests are for browsing categories and then each one request should be one result. So results should be much less. Are we sure, that some requests are not handled multiple times?

For product detail request is set uniqueKey as productId. From 19.4k results are only 14k unique products (14k results sounds Ok). So requests are processed multiple times, or results were added to dataset multiple times. In log file I not found any useful informations.

Expected behavior Requests should not be handled multiple times.

System information:

mnmkng commented 3 years ago

I don't think it's possible for requests to be randomly processed multiple times. At least we haven't heard about it from anyone else. I think it's caused by duplicate results in the dataset. This used to happen when the dataset API was overloaded, but then we fixed it, but I guess it's still happening then. cc @mtrunkat

Anyway, this is most likely not an issue with the SDK, but the Apify platform. Closing. But feel free to reopen if you can provide a reproduction.

rubydev commented 3 years ago

@mnmkng Thanks.

I looked again to logs and there is maybe something weird. I'm not sure, why there are also "handle request timeouts" like this: Error: handleRequestFunction timed out after 110 seconds. In code is set only handlePageTimeoutSecs: 40. PuppeteerCrawler haven't handleRequestFunction, so why there are these timeouts? I'm not sure from where is number 110. See screenshot:

screenshot 423

Edit: In another actor where is set handlePageTimeoutSecs: 30 are Error: handleRequestFunction timed out after 100 seconds. So in one actor it is 110, and another 100.

mnmkng commented 3 years ago

TLDR; hotfix is to increase either the navigationTimeoutSecs or handlePageTimeoutSecs

A default handleRequestTimeout is always there to prevent the actor from hanging indefinitely when a process that's outside of the handlePageFunction would hang.

It works like below.

handleRequestFunction() {
  // create new page
  // preNavigationHooks
  // navigation (page.goto)
  // postNavigationHooks
  handlePageFunction()
  // some cleanup
}

The timeout should almost never happen, but it needs to be there to prevent hanging. It's calculated like this:

handleRequestTimeoutSecs: navigationTimeoutSecs + handlePageTimeoutSecs + 10

So there's probably some activity that is running outside of page.goto or handlePage and takes more than 10 secs. Maybe you're solving the captchas in a preNavigationHook?

mnmkng commented 3 years ago

We now have 2 timeouts + buffer, but they don't cover pre|postNavigationHooks. I don't know what a good solution would be here.

  1. add another timeout for the hooks; seems like overkill
  2. increase the buffer constant; seems wrong
  3. make handleRequestTimeout configurable from higher level crawlers, now it's hardcoded; dumb but maybe the easiest solution
  4. include the hooks in the navigation timeout; does it make sense?

cc @B4nan cc @szmarczak because you were looking at the other timeouts

rubydev commented 3 years ago

@mnmkng Thank you for explanation.

mnmkng commented 3 years ago
* Is in `handleRequestFunction` included also `postLaunchHook`?

A postLaunchHook runs in the // new page part of handleRequestFunction, so yeah, it's included there.

* So this is not related with duplicated results (requests processed multiple times), right?

It could be, because if the "pre-processing" takes too long, the handleRequestFunction timeout might fire before the handlePageFunction or further processing completes, thus the request will be retried. But in the background the handlePageFunction could still be running because there's no way to cancel a running function.

Imagine this:

navigationTimeoutSecs: 30 handlePageTimeoutSecs: 30 handleRequestTimeoutSecs: 30 + 30 + 10 = 70

Since postLaunchHooks are invoked before the first two timeouts, if they take 80 seconds, the handleRequestFunction will throw and retry the request, but the handlePageFunction will continue, leading to double request.

So this might be a bug in the end, not just a missing timeout feature. 🤯

mnmkng commented 3 years ago

Related, perhaps a solution: https://github.com/apify/apify-js/issues/1102

rubydev commented 3 years ago

Yes, in postLaunch hook we are opening 1 page (homepage) and solve captcha, and it can take long. Just for open page there is 20 second timeout + solving captcha:

module.exports.postLaunchHook = async (pageId, browserController, input) => {
    const { storeId, zipCode } = input;
    const { browser } = browserController;
    const page = await browser.newPage();

    await Apify.utils.puppeteer.blockRequests(page, { urlPatterns: BLOCK_PATTERNS });

    try {
        await page.goto('https://www.walmart.com', { timeout: 20 * 1000 }); // set initial cookies
        await this.unblockSession({ page }); // try solve captcha
        await locationTools.prepareLocation(page, storeId, zipCode); // set location
        await page.close();
    } catch (e) {
        log.warning(`Cookies not set: ${e}`);
    }
};

It is in postLaunch hook because we are not able open any requests until we open Walmart homepage. But we not want to open it for every request, once per browser is enough.

rubydev commented 3 years ago

In second one actor we using hooks too, but these hooks should not take any time and handleRequestFunction timeouts occur.

preNavigationHook:

exports.preNavigationHook = async ({ page, request, session, maxRequestRetries, maxRootRequestRetries }) => {
    if (request.userData.initRequest && request.retryCount > 0) {
        if (request.userData.rootRetryCount < (maxRootRequestRetries - maxRequestRetries)) {
            request.userData.rootRetryCount += 1;
            request.retryCount = 0; // reset retry count for root page
        }

        log.warning(`Session retired - root page failed (attempt ${request.userData.rootRetryCount + request.retryCount})`);
        session.retire();
    }

    await Apify.utils.puppeteer.blockRequests(page, { urlPatterns: ['.png', '.jpg', 'suggestions', 'is-appupsell-eligible', 'dmusic'] });
};

preLaunchHook:

exports.preLaunchHook = async (pageId, launchContext) => {
    launchContext.launchOptions.defaultViewport = {
        width: 1024 + Math.floor(Math.random() * 900),
        height: 768 + Math.floor(Math.random() * 300),
    };
    launchContext.launchOptions.args = ['--no-sandbox'];
    launchContext.userAgent = randomUA.generate();
};

For this actor I increased handlePageTimeoutSecs from 30 to 60 seconds and it not help. Now handleRequestFunction timeout after 130sec (before after 100sec): Error: handleRequestFunction timed out after 130 seconds.

mnmkng commented 3 years ago

Hmm, that's weird. Maybe it's getting completely stuck somewhere. Please send me the run links on Slack.

rubydev commented 3 years ago

@mnmkng I made some changes and it looks better. I also increased handlePageTimeoutSecs to 120seconds as temporary solution.

We now have 2 timeouts + buffer, but they don't cover pre|postNavigationHooks. I don't know what a good solution would be here.

  1. add another timeout for the hooks; seems like overkill
  2. increase the buffer constant; seems wrong
  3. make handleRequestTimeout configurable from higher level crawlers, now it's hardcoded; dumb but maybe the easiest solution
  4. include the hooks in the navigation timeout; does it make sense?

cc @B4nan cc @szmarczak because you were looking at the other timeouts

I think, that: 1.) is best and handleRequestTimeout should be total for each step. 2.) is wrong, nobody can know how many time will be need in different cases. 3.) is also wrong. If we will set bigger number, handleRequestFunction can still timeout and sub-function handlePageFunction can still be in-process, if process somewhere get stuck. So current issuse will happen less often, but it doesn’t look like proper solution. 4.) For pre|postNavigationHooks maybe yes, but for example postLaunchHook isn't related with navigation, timeout should be separately.

Also I think, that postLaunchHook which is called only once after browser is opened, should not be part of handleRequestFunction which is called for each request. If I understand it correctly :) Maybe for browser pool should be able set timeout for each hook.

mnmkng commented 3 years ago

Reproduction case for @B4nan:

const Apify = require('apify');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();
    await requestQueue.addRequest({ url: 'https://www.iana.org/' });

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        navigationTimeoutSecs: 10,
        handlePageTimeoutSecs: 10,
        browserPoolOptions: {
            preLaunchHooks: [
                async () => {
                    // Do some async work that's longer than
                    // navTimeout + pageTimeout + buffer constant
                    //  10 + 10 + 10
                    await Apify.utils.sleep(40 * 1000);
                },
            ],
        },
        // This will behave similarly, but worse,
        // because launchHooks fire only once per browser,
        // whereas this will cause the issue to happen
        // request.retryCount times.
        // preNavigationHooks: [
        //     async () => {
        //         await Apify.utils.sleep(40 * 1000);
        //     },
        // ],
        handlePageFunction: async ({ request }) => {
            console.log('This should not be logged, because the request should not be running!', request.retryCount);
        },
    });

    await crawler.run();
});

Log showing that the request was reclaimed and retried, but handlePageFunction still ran, which means that any work would be done twice leading to duplicate results and what not.

INFO  System info {"apifyVersion":"2.0.7","apifyClientVersion":"1.4.2","osType":"Darwin","nodeVersion":"v16.11.1"}
INFO  PuppeteerCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 4096 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.iana.org/","retryCount":1,"id":"e2Hd517QWfF4tVh"}
  Error: handleRequestFunction timed out after 30 seconds.
      at Timeout._onTimeout (/Users/mnmkng/Projects/Apify/Actors/bug-test/node_modules/apify/build/utils.js:314:49)
      at listOnTimeout (node:internal/timers:557:17)
      at processTimers (node:internal/timers:500:7)
This should not be logged, because the request should not be running! 1
INFO  PuppeteerCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO  PuppeteerCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[null,1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":9732,"requestsFinishedPerMinute":1,"requestsFailedPerMinute":0,"requestTotalDurationMillis":9732,"requestsTotal":1,"crawlerRuntimeMillis":43190}
szmarczak commented 3 years ago

I'm getting a different result. It says that all requests have been processed, but the process is still going on.

logs ``` szm@pop-os:~/Desktop/demo$ node index.js INFO System info {"apifyVersion":"2.1.0","apifyClientVersion":"2.0.2","osType":"Linux","nodeVersion":"v16.13.0"} WARN Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/home/szm/Desktop/demo/apify_storage" WARN The following Request queue directory contains a previous state: /home/szm/Desktop/demo/apify_storage/request_queues/default If you did not intend to persist the state - please clear the respective directory and re-start the actor. INFO PuppeteerCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 3998 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it. INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}} ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.iana.org/","retryCount":1,"id":"e2Hd517QWfF4tVh"} Error: handleRequestFunction timed out after 30 seconds. at Timeout._onTimeout (/home/szm/Desktop/demo/node_modules/apify/build/utils.js:350:49) at listOnTimeout (node:internal/timers:557:17) at processTimers (node:internal/timers:500:7) INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60038,"retryHistogram":[]} INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.iana.org/","retryCount":2,"id":"e2Hd517QWfF4tVh"} Error: handleRequestFunction timed out after 30 seconds. at Timeout._onTimeout (/home/szm/Desktop/demo/node_modules/apify/build/utils.js:350:49) at listOnTimeout (node:internal/timers:557:17) at processTimers (node:internal/timers:500:7) ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.iana.org/","retryCount":3,"id":"e2Hd517QWfF4tVh"} Error: handleRequestFunction timed out after 30 seconds. at Timeout._onTimeout (/home/szm/Desktop/demo/node_modules/apify/build/utils.js:350:49) at listOnTimeout (node:internal/timers:557:17) at processTimers (node:internal/timers:500:7) INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120050,"retryHistogram":[]} INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} ERROR PuppeteerCrawler: Request failed and reached maximum retries {"id":"e2Hd517QWfF4tVh","url":"https://www.iana.org/","method":"GET","uniqueKey":"https://www.iana.org"} Error: handleRequestFunction timed out after 30 seconds. at Timeout._onTimeout (/home/szm/Desktop/demo/node_modules/apify/build/utils.js:350:49) at listOnTimeout (node:internal/timers:557:17) at processTimers (node:internal/timers:500:7) INFO PuppeteerCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down. INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":30009,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":30009,"requestsTotal":1,"crawlerRuntimeMillis":180052,"retryHistogram":[null,null,null,1]} INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":30009,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":30009,"requestsTotal":1,"crawlerRuntimeMillis":240052,"retryHistogram":[null,null,null,1]} INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":30009,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":30009,"requestsTotal":1,"crawlerRuntimeMillis":300059,"retryHistogram":[null,null,null,1]} INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":30009,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":30009,"requestsTotal":1,"crawlerRuntimeMillis":360070,"retryHistogram":[null,null,null,1]} ```
szmarczak commented 3 years ago

Hm, that probably was the 0 concurrency bug. I'm getting the This should not be logged message now.

rubydev commented 2 years ago

@B4nan @mnmkng Hi, I would like to ask you, when will be released this fix?