apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.85k stars 683 forks source link

Multiple calls to enqueueLinks with Promise.all result in a crash #2322

Open janbuchar opened 9 months ago

janbuchar commented 9 months ago

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Executing the snippet below results in the following error:

Error: Lock file is already being held
    at /home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/proper-lockfile/lib/lockfile.js:68:47
    at callback (/home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/graceful-fs/polyfills.js:306:20)
    at FSReqCallback.oncomplete (node:fs:205:5)
    at FSReqCallback.callbackTrampoline (node:internal/async_hooks:130:17)

Also, it's strange that the traceback doesn't lead to user code.

Code sample

import { CheerioCrawler, PlaywrightCrawler } from 'crawlee';

const startUrls = ['https://warehouse-theme-metal.myshopify.com/collections'];

const crawler = new CheerioCrawler({
    requestHandler: async ({request, enqueueLinks, parseWithCheerio}) => {
        const $ = await parseWithCheerio()
        console.log(`Processing ${request.loadedUrl}`)
        const urls = $('a.collection-block-item, a.pagination__next, .product-item > a').toArray().map(it => it.attribs['href'])
        await Promise.all(urls?.map(url => enqueueLinks({urls})))
    },
    maxRequestRetries: 0,
    maxConcurrency: 1,
});

await crawler.run(startUrls);

Package version

3.7.3

Node.js version

v21.6.1

Operating system

Linux

Apify platform

I have tested this on the next release

No response

Other context

No response

janbuchar commented 9 months ago

I know that the reproduction example is a very bad use of the method. I made it this way to make the code crash deterministically.

Somebi commented 5 months ago

Same here, not sure where to look for the problem... seems like a missing try/catch or a improperly handled rejection, or a missing error listener somewhere in the library code.

Jack-Lewis1 commented 4 months ago

Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.

janbuchar commented 4 months ago

Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.

It calls enqueueLinks with the complete set of URLs for each URL in the set. This is completely unnecessary and slow. The correct approach would be calling enqueueLinks once, with a selector.

Even though it's bad usage, it uncovered a bug that's worth fixing, which is why I put the sample here.

mbrys0n commented 2 months ago

Encountering same issues on 3.11.3 as well.

Error: Lock file is already being held
crawler   |     at node_modules/proper-lockfile/lib/lockfile.js:68:47
crawler   |     at callback (node_modules/graceful-fs/polyfills.js:306:20)
crawler   |     at FSReqCallback.oncomplete (node:fs:198:5)
crawler   |     at FSReqCallback.callbackTrampoline (node:internal/async_hooks:130:17) 

crawler   |   code: 'ELOCKED',
crawler   |   file: '/storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json'
janbuchar commented 2 months ago

Encountering same issues on 3.11.3 as well.

Hi @mbrys0n, could you please share the code that triggers the crash as well?

mbrys0n commented 2 months ago

Think was a different cause sorry @janbuchar but with the same symptom. Posting here in case others encounter:

Aforementioned was caused by running concurrent crawler instances. We resolved by assigning separate sessionPoolOptions: { persistStateKey} giving them separate files.

Seems crawler instances will compete by default for the session pool file, leading to a race condition on locking the state. In our case this led to intermittently exceeding the ~10 second lockout timer here:

https://github.com/apify/crawlee/blob/5db533afdb70eec98b077262ecfcb8efea89978b/packages/memory-storage/src/background-handler/fs-utils.ts#L68-L85