apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.62k stars 606 forks source link

Crawler not finished / not resolving #505

Closed andyfo closed 4 years ago

andyfo commented 4 years ago

I've been testing Apify extensively recently and I've noticed a strange behavior - crawler sometimes doesn't stop/end properly when maxRequestsPerCrawl is reached and there are still requests being processed. It just never resolves.

I put a logger into the _maybeFinish method and it goes like this:

INFO: BasicCrawler: Crawler reached the maxRequestsPerCrawl limit of 200 requests and will shut down soon. Requests that are in progress will be allowed to finish.
_maybeFinish: _currentConcurrency = 2
_maybeFinish: _currentConcurrency = 1

And then silence. After the maxRequestsPerCrawl INFO the _maybeFinish polling stops at 1 and then nothing. It doesn't get past the await crawler.run();.

Any idea what I could be doing wrong? 🤔

Thanks!

mnmkng commented 4 years ago

@surfshore

0.20.3-dev.2 now has graceful closing of database connections, which may help in multi process scenarios.

cspeer commented 4 years ago

@mnmkng I tested it over night and now everything is completely error free! Thanks a bunch, man!

mnmkng commented 4 years ago

Thanks @cspeer. Still, let us know if you experience any "irregularities".

surfshore commented 4 years ago

@mnmkng Thanks for new code!

"puppeteerCrawler CPU load increases gradually" problem. (and Crawler not finished) I simplified the code and checked it. It seems to be reproduced when the following settings are made.

puppeteerPoolOptions: {
    retireInstanceAfterRequestCount: 2, <- If the number is small, it will be reproduced immediately. about 50 requests.
},
launchPuppeteerOptions: {
    stealth: true,
    stealthOptions: { 'addLanguage': false }, <- here!
}

I wanted to rewrite the window.navigator.languages variable, so I set Accept-Language to request.headers. Let me know if you know a better way. version: HeadlessChrome/80.0.3987.0 // page.browser().version();

"database is locked" problem. I tested again with 0.20.3-dev.2, but our cases did not solve the problem. By changing APIFY_LOCAL_STORAGE_DIR for each crawler, the problem has been resolved. Thanks @cspeer ! In this version, different QUEUE_ID seem to be managed by the same database. Do you plan to create a database file for each QUEUE_ID? e.g.

{APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/db.sqlite
mnmkng commented 4 years ago

It seems to be reproduced when the following settings are made.

Interesting. So without the addLanguage: false it runs fine?

EDIT: I just now realized, from the previous logs it seems that your requests are quite long running. Setting retireInstanceAfterRequestCount: 2 will not kill the browser until the two requests are complete. If the concurrency keeps going up, but it's taking a long time to finish the requests, more and more browsers are being opened and at some point, it may overwhelm the machine.

It would be great if you could share some simple code that reproduces it.

Do you plan to create a database file for each QUEUE_ID?

No, but we'll look into ways to improve the multi-process performance and also to easily enable each process to create it's own database, which will be able to hold all the queues for that process. Multiple queues in a single file are not a problem, unless multiple processes access them.

surfshore commented 4 years ago

So without the addLanguage: false it runs fine? it's taking a long time to finish the requests, more and more browsers are being opened and at some point, it may overwhelm the machine.

Yes. If a problem occurs, the chrome process remains. My test code is here. A problem occurs when comment out the "stealthOptions" line.

Apify.main(async () => {
    Apify.utils.log.setLevel(Apify.utils.log.LEVELS.DEBUG);
    const requestQueue = await makeRequests();

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue: requestQueue,
        minConcurrency: 3,
        maxConcurrency: 5,
        autoscaledPoolOptions: {
            desiredConcurrency: 5,
        },
        puppeteerPoolOptions: {
            retireInstanceAfterRequestCount: 2,
        },
        launchPuppeteerOptions: {
            stealth: true,
            //stealthOptions: { 'addLanguage': false },
            headless: true,
        },
        handlePageFunction: async ({ request, page }) => {
            console.log(`Processing ${request.url} ...`);
            const body = await page.content();
            //const version = await page.browser().version();
            console.log(Buffer.byteLength(body));
        },
        handleFailedRequestFunction: async ({ request }) => {
            console.log(`Request ${request.url} failed.`);
        },
    });

    await crawler.run();

    console.log('Crawler finished.');
});

EDIT The URL was tested on two different our sites with simple page. In my environment, the problem occurs with about 50 requests.

https://mysites/test.html?a=1
https://mysites/test.html?a=2
...
https://mysites/test.html?a=100

EDIT

I wanted to rewrite the window.navigator.languages variable, so I set Accept-Language to request.headers. Let me know if you know a better way.

It was my misunderstanding. I wanted to rewrite the window.navigator.languages variable, so I set window.navigator.languages to gotoFunction. However, it has no effect unless stealthOptions.addLanguage is set to false. It is helpful if i can set it freely if possible.

e.g.

stealthOptions: { 'addLanguage': props.languages },

current code

launchPuppeteerOptions: {
    stealth: true,
    stealthOptions: { 'addLanguage': false },
    ...
}

const gotoFunction = async ({ page, request }) => {
    const props = getProps();
    await page.evaluateOnNewDocument((props) => {
        delete navigator.__proto__.languages;
        navigator.__defineGetter__('languages', () => props.languages);
    }, props);
...
};
mnmkng commented 4 years ago

@surfshore I created a new issue to track this problem with stealth. Closing this one, as it seems that the RequestQueue related issues were solved. Barring the multi-process usage.

jektvik commented 4 years ago

Just a quick follow-up: I tested the dev branch with sqllite yesterday and saw no issues, got the same result as on my previous belts-and-bracers filesystem based crawl with ~46,000 pages. It was not extensive testing yet though. Looking forward to this finding its way into master.