apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
13.44k stars 572 forks source link

Multiple calls to enqueueLinks with Promise.all result in a crash #2322

Open janbuchar opened 5 months ago

janbuchar commented 5 months ago

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Executing the snippet below results in the following error:

Error: Lock file is already being held
    at /home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/proper-lockfile/lib/lockfile.js:68:47
    at callback (/home/janbuchar/Projekty/Apify/enqueue-links-crash-repro/node_modules/graceful-fs/polyfills.js:306:20)
    at FSReqCallback.oncomplete (node:fs:205:5)
    at FSReqCallback.callbackTrampoline (node:internal/async_hooks:130:17)

Also, it's strange that the traceback doesn't lead to user code.

Code sample

import { CheerioCrawler, PlaywrightCrawler } from 'crawlee';

const startUrls = ['https://warehouse-theme-metal.myshopify.com/collections'];

const crawler = new CheerioCrawler({
    requestHandler: async ({request, enqueueLinks, parseWithCheerio}) => {
        const $ = await parseWithCheerio()
        console.log(`Processing ${request.loadedUrl}`)
        const urls = $('a.collection-block-item, a.pagination__next, .product-item > a').toArray().map(it => it.attribs['href'])
        await Promise.all(urls?.map(url => enqueueLinks({urls})))
    },
    maxRequestRetries: 0,
    maxConcurrency: 1,
});

await crawler.run(startUrls);

Package version

3.7.3

Node.js version

v21.6.1

Operating system

Linux

Apify platform

I have tested this on the next release

No response

Other context

No response

janbuchar commented 5 months ago

I know that the reproduction example is a very bad use of the method. I made it this way to make the code crash deterministically.

Somebi commented 2 weeks ago

Same here, not sure where to look for the problem... seems like a missing try/catch or a improperly handled rejection, or a missing error listener somewhere in the library code.

Jack-Lewis1 commented 1 day ago

Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.

janbuchar commented 1 day ago

Hey @janbuchar, why is this a bad use of the method? I'm running into a similar error.

It calls enqueueLinks with the complete set of URLs for each URL in the set. This is completely unnecessary and slow. The correct approach would be calling enqueueLinks once, with a selector.

Even though it's bad usage, it uncovered a bug that's worth fixing, which is why I put the sample here.