apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.35k stars 659 forks source link

The status message crawling state doesn't persist abort #1855

Closed HonzaKirchner closed 1 year ago

HonzaKirchner commented 1 year ago

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

The status message crawling state doesn't persist abort (Crawled 1973/133 pages). Unfortunately, there is nothing we can do if user does immediate abort but we can pull the request queue state at the start to sync the state. This is quite prominently displayed so we should probably do this extra step.

Motivation

The issue was reported here

Ideal solution or implementation, and any additional constraints

🤔

Alternative solutions or implementations

No response

Other context

No response

barjin commented 1 year ago

Huh, that's a good point - this might be the same issue we encounter when a migration happens. Perhaps we can store the total number of enqueued links in the crawler.stats? This should fix the inconsistencies (the number of processed requests is also loaded from the stats), should be durable enough in case of migration / graceful abort, and seems to me like the path of least resistance for this feature right now.

@B4nan am I missing something in the bigger scale (e.g. is there a major way to enqueue links without enqueueLinks)?

B4nan commented 1 year ago

enqueueLinks uses RequestQueue.addRequests, which is also used in crawler.addRequests so fixes should be implemented on the RequestQueue level ideally, to cover all the code paths. Otherwise sounds good to me.

B4nan commented 1 year ago

Additional discussion here: https://apifier.slack.com/archives/C0L33UM7Z/p1696325864681189

Lukáš K. If user aborts without the "graceful" option, there is no way you can correctly persist the state :confused: But we can load the info from the queue at the start of the actor (edited)