apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.39k stars 662 forks source link

handledRequestCount from requestQueue.getInfo() after restart is 0 #2465

Open Bec-k opened 5 months ago

Bec-k commented 5 months ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/cheerio (CheerioCrawler)

Issue description

Create a queue with a name. Add 5 requests into the queue to it. Exit before queue is depleted. Restart app and check requestQueue.getInfo() It will have handledRequestCount:0, even though totalRequestCount will be 5.

Code sample

No response

Package version

3.9.2

Node.js version

v22.1.0

Operating system

Ubuntu

Apify platform

I have tested this on the next release

3.9.3-beta.42

Other context

There are also problems with actual

queueInfo: {
  accessedAt: 2024-05-14T15:33:01.889Z,
  createdAt: 2024-05-14T15:32:59.541Z,
  hadMultipleClients: false,
  handledRequestCount: 1,
  id: 'c7f9d136-652e-48e1-aa44-1519a159f2c7',
  modifiedAt: 2024-05-14T15:33:01.871Z,
  name: 'foobar',
  pendingRequestCount: 0,
  stats: {},
  totalRequestCount: 2,
  userId: '1'
}

handledRequestCount:1 is incorrect, i have fetched queueInfo after await crawler.run();

Bec-k commented 5 months ago

I guess that this counter is representing current runtime handled requests and not the whole handled requests of the queue... I guess you need another counter, which will represent totalHandledRequestCount or lifetimeHandledRequestCount or overallHandledRequestCount.