Closed rubydev closed 2 years ago
I don't think it's possible for requests to be randomly processed multiple times. At least we haven't heard about it from anyone else. I think it's caused by duplicate results in the dataset. This used to happen when the dataset API was overloaded, but then we fixed it, but I guess it's still happening then. cc @mtrunkat
Anyway, this is most likely not an issue with the SDK, but the Apify platform. Closing. But feel free to reopen if you can provide a reproduction.
@mnmkng Thanks.
I looked again to logs and there is maybe something weird. I'm not sure, why there are also "handle request timeouts" like this: Error: handleRequestFunction timed out after 110 seconds
. In code is set only handlePageTimeoutSecs: 40
. PuppeteerCrawler haven't handleRequestFunction
, so why there are these timeouts? I'm not sure from where is number 110. See screenshot:
Edit: In another actor where is set handlePageTimeoutSecs: 30
are Error: handleRequestFunction timed out after 100 seconds
. So in one actor it is 110, and another 100.
TLDR; hotfix is to increase either the navigationTimeoutSecs
or handlePageTimeoutSecs
A default handleRequestTimeout
is always there to prevent the actor from hanging indefinitely when a process that's outside of the handlePageFunction
would hang.
It works like below.
handleRequestFunction() {
// create new page
// preNavigationHooks
// navigation (page.goto)
// postNavigationHooks
handlePageFunction()
// some cleanup
}
The timeout should almost never happen, but it needs to be there to prevent hanging. It's calculated like this:
handleRequestTimeoutSecs: navigationTimeoutSecs + handlePageTimeoutSecs + 10
So there's probably some activity that is running outside of page.goto
or handlePage
and takes more than 10 secs. Maybe you're solving the captchas in a preNavigationHook
?
We now have 2 timeouts + buffer, but they don't cover pre|postNavigationHooks
. I don't know what a good solution would be here.
handleRequestTimeout
configurable from higher level crawlers, now it's hardcoded; dumb but maybe the easiest solutioncc @B4nan cc @szmarczak because you were looking at the other timeouts
@mnmkng Thank you for explanation.
handleRequestFunction
included also postLaunchHook
? * Is in `handleRequestFunction` included also `postLaunchHook`?
A postLaunchHook
runs in the // new page
part of handleRequestFunction
, so yeah, it's included there.
* So this is not related with duplicated results (requests processed multiple times), right?
It could be, because if the "pre-processing" takes too long, the handleRequestFunction
timeout might fire before the handlePageFunction
or further processing completes, thus the request will be retried. But in the background the handlePageFunction
could still be running because there's no way to cancel a running function.
Imagine this:
navigationTimeoutSecs: 30 handlePageTimeoutSecs: 30 handleRequestTimeoutSecs: 30 + 30 + 10 = 70
Since postLaunchHooks
are invoked before the first two timeouts, if they take 80 seconds, the handleRequestFunction
will throw and retry the request, but the handlePageFunction
will continue, leading to double request.
So this might be a bug in the end, not just a missing timeout feature. 🤯
Related, perhaps a solution: https://github.com/apify/apify-js/issues/1102
Yes, in postLaunch
hook we are opening 1 page (homepage) and solve captcha, and it can take long. Just for open page there is 20 second timeout + solving captcha:
module.exports.postLaunchHook = async (pageId, browserController, input) => {
const { storeId, zipCode } = input;
const { browser } = browserController;
const page = await browser.newPage();
await Apify.utils.puppeteer.blockRequests(page, { urlPatterns: BLOCK_PATTERNS });
try {
await page.goto('https://www.walmart.com', { timeout: 20 * 1000 }); // set initial cookies
await this.unblockSession({ page }); // try solve captcha
await locationTools.prepareLocation(page, storeId, zipCode); // set location
await page.close();
} catch (e) {
log.warning(`Cookies not set: ${e}`);
}
};
It is in postLaunch
hook because we are not able open any requests until we open Walmart homepage. But we not want to open it for every request, once per browser is enough.
In second one actor we using hooks too, but these hooks should not take any time and handleRequestFunction
timeouts occur.
preNavigationHook:
exports.preNavigationHook = async ({ page, request, session, maxRequestRetries, maxRootRequestRetries }) => {
if (request.userData.initRequest && request.retryCount > 0) {
if (request.userData.rootRetryCount < (maxRootRequestRetries - maxRequestRetries)) {
request.userData.rootRetryCount += 1;
request.retryCount = 0; // reset retry count for root page
}
log.warning(`Session retired - root page failed (attempt ${request.userData.rootRetryCount + request.retryCount})`);
session.retire();
}
await Apify.utils.puppeteer.blockRequests(page, { urlPatterns: ['.png', '.jpg', 'suggestions', 'is-appupsell-eligible', 'dmusic'] });
};
preLaunchHook:
exports.preLaunchHook = async (pageId, launchContext) => {
launchContext.launchOptions.defaultViewport = {
width: 1024 + Math.floor(Math.random() * 900),
height: 768 + Math.floor(Math.random() * 300),
};
launchContext.launchOptions.args = ['--no-sandbox'];
launchContext.userAgent = randomUA.generate();
};
For this actor I increased handlePageTimeoutSecs
from 30 to 60 seconds and it not help. Now handleRequestFunction
timeout after 130sec (before after 100sec):
Error: handleRequestFunction timed out after 130 seconds.
Hmm, that's weird. Maybe it's getting completely stuck somewhere. Please send me the run links on Slack.
@mnmkng
I made some changes and it looks better. I also increased handlePageTimeoutSecs
to 120seconds as temporary solution.
We now have 2 timeouts + buffer, but they don't cover
pre|postNavigationHooks
. I don't know what a good solution would be here.
- add another timeout for the hooks; seems like overkill
- increase the buffer constant; seems wrong
- make
handleRequestTimeout
configurable from higher level crawlers, now it's hardcoded; dumb but maybe the easiest solution- include the hooks in the navigation timeout; does it make sense?
cc @B4nan cc @szmarczak because you were looking at the other timeouts
I think, that:
1.) is best and handleRequestTimeout should be total for each step.
2.) is wrong, nobody can know how many time will be need in different cases.
3.) is also wrong. If we will set bigger number, handleRequestFunction
can still timeout and sub-function handlePageFunction
can still be in-process, if process somewhere get stuck. So current issuse will happen less often, but it doesn’t look like proper solution.
4.) For pre|postNavigationHooks
maybe yes, but for example postLaunchHook
isn't related with navigation, timeout should be separately.
Also I think, that postLaunchHook
which is called only once after browser is opened, should not be part of handleRequestFunction
which is called for each request. If I understand it correctly :) Maybe for browser pool should be able set timeout for each hook.
Reproduction case for @B4nan:
const Apify = require('apify');
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: 'https://www.iana.org/' });
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
navigationTimeoutSecs: 10,
handlePageTimeoutSecs: 10,
browserPoolOptions: {
preLaunchHooks: [
async () => {
// Do some async work that's longer than
// navTimeout + pageTimeout + buffer constant
// 10 + 10 + 10
await Apify.utils.sleep(40 * 1000);
},
],
},
// This will behave similarly, but worse,
// because launchHooks fire only once per browser,
// whereas this will cause the issue to happen
// request.retryCount times.
// preNavigationHooks: [
// async () => {
// await Apify.utils.sleep(40 * 1000);
// },
// ],
handlePageFunction: async ({ request }) => {
console.log('This should not be logged, because the request should not be running!', request.retryCount);
},
});
await crawler.run();
});
Log showing that the request was reclaimed and retried, but handlePageFunction
still ran, which means that any work would be done twice leading to duplicate results and what not.
INFO System info {"apifyVersion":"2.0.7","apifyClientVersion":"1.4.2","osType":"Darwin","nodeVersion":"v16.11.1"}
INFO PuppeteerCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 4096 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.iana.org/","retryCount":1,"id":"e2Hd517QWfF4tVh"}
Error: handleRequestFunction timed out after 30 seconds.
at Timeout._onTimeout (/Users/mnmkng/Projects/Apify/Actors/bug-test/node_modules/apify/build/utils.js:314:49)
at listOnTimeout (node:internal/timers:557:17)
at processTimers (node:internal/timers:500:7)
This should not be logged, because the request should not be running! 1
INFO PuppeteerCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
INFO PuppeteerCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[null,1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":9732,"requestsFinishedPerMinute":1,"requestsFailedPerMinute":0,"requestTotalDurationMillis":9732,"requestsTotal":1,"crawlerRuntimeMillis":43190}
I'm getting a different result. It says that all requests have been processed, but the process is still going on.
Hm, that probably was the 0 concurrency bug. I'm getting the This should not be logged
message now.
@B4nan @mnmkng Hi, I would like to ask you, when will be released this fix?
Describe the bug Lately, I've noticed, that in 2 different actors with Puppeteer where is set maxConcurrency around 80-100, is sometimes in dataset much more results than handled requests. For example 19.4k results, but handled 18.1k requests. Which should not be possible. Some requests are for browsing categories and then each one request should be one result. So results should be much less. Are we sure, that some requests are not handled multiple times?
For product detail request is set
uniqueKey
as productId. From 19.4k results are only 14k unique products (14k results sounds Ok). So requests are processed multiple times, or results were added to dataset multiple times. In log file I not found any useful informations.Expected behavior Requests should not be handled multiple times.
System information:
apify/actor-node-puppeteer-chrome:16