apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.99k stars 628 forks source link

bug: `RequestQueue.addRequest`'s `forefront` option doesn't work #2669

Open prakashgp opened 1 week ago

prakashgp commented 1 week ago

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Add lots of initial urls to the request queue Once a url is scraped, add 2nd level scraping request with forefront = true. Records will be pushed to dataset in 2nd level scraping requests which are derived from result of 1st level request. But due to the bug and large number of initial requests, even if we push 2nd level requests to front of the queue, they still gets scraped at the end. And because of this scraper runs for long time without adding any records to the dataset

Code sample

import { BasicCrawler, RequestQueue, Configuration } from 'crawlee'
import {Actor} from 'apify'

let requestQueue = await RequestQueue.open('CUSTOM', {
    config: new Configuration({persistStorage: false})
})

const crawler = new BasicCrawler({
    requestQueue,
    async requestHandler(context) {
        const { request } = context
        const { url, label } = request
        if (label == 'LEVEL2') {
            await Actor.pushData({url, label}) // this task will be executed at the end of the scraping process, even though `forefront` was set to `true`
        } else {
            // add more requests to scrape on 2nd level, but want to add them to front of the queue
            context.addRequests([{url: 'https://googl.com/', label: 'LEVEL2'}], {forefront: true}) // <<< BUG: Always gets scraped at the end after all 1st level urls are scraped. 
            // if the number of initial urls are too large, then scraper will wait until all urls are scraped to push items to dataset because of this bug.
        }
    }
}, new Configuration({persistStorage: false}))

await requestQueue.addRequests([
    // intial urls to scrape
])

await crawler.run()

Package version

^3.11.3

Node.js version

v20.13.0

Operating system

Mac OS

Apify platform

I have tested this on the next release

Error: Detected incompatible Crawlee version used by the SDK. User installed 3.11.4-beta.0 but the SDK uses 3.11.3

Other context

No response

barjin commented 5 days ago

Thank you for your report!

It indeed seems that the forefront option does not work as expected with the MemoryStorage implementation of RequestQueue.

As far as I figured, this is caused by two different problems:

  1. The forefront option is encoded by a negative orderNo value. However, we do not consider this in the listHead and similar methods. This IMO causes the main issue with forefront requests not being handled differently from the regular requests.

https://github.com/apify/crawlee/blob/2fa8a29b815689f041f3d06cc0563e77e02e05f4/packages/memory-storage/src/resource-clients/request-queue.ts#L220-L225

  1. The RequestQueue batching behavior - the RequestQueue reads the requests in batches of 25 (and then processes those). Once the current batch is done, the RequestQueue reads another batch. This works well for the append-only idea of a queue but breaks once we add "priority" requests - those aren't added to the current batch but to the "cold" queue (actually, due to 1., those are just appended to the end of the queue).

I remember @vladfrangu did most of the work on RQv2 - are my assumptions correct here, or did I miss something? To be honest, I'm not sure how to solve this issue, really 🤷🏽 Maybe just add a special forefront store that stores the forefront-enqueued requests, aside from the actual queue? And add a performance warning to the option's documentation?

vladfrangu commented 5 days ago

Oof, nice catch. We'll either have to collect all requests, sort, then list (which sounds super inefficient), store requests sorted (with smth like insert-sort) in-memory, or split forefront into its own map and go from there

B4nan commented 5 days ago

Tick me if you encountered this issue on the Apify platform

@prakashgp your code does not contain Actor.init() and Actor.exit() calls, so it will use the default memory storage even on the platform. If you add those, you will use the API instead of memory storage, where the forefront option works just fine.