apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.39k stars 662 forks source link

RequestQueue.getRequest() should use local cache #297

Open jancurn opened 5 years ago

jancurn commented 5 years ago

This shouldn't cause any problem and can greatly improve performance. See TODO at https://github.com/apifytech/apify-js/blob/master/src/request_queue.js#L276

jancurn commented 5 years ago

Actually, since the underlying storage is not read-after-write consistent, calling addRequest and getRequest immediately after that might return null, and thus cause weird bugs. I'm flagging this as bug then.

jancurn commented 5 years ago

This might also be the cause of this problem:

2019-02-14T11:59:46.283Z ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.example.com/","retryCount":1} (error details: type=record-not-found, statusCode=404)
2019-02-14T11:59:46.286Z   ApifyClientError: Record was not found
2019-02-14T11:59:46.288Z     at exports.newApifyClientErrorFromResponse (/home/myuser/node_modules/apify-client/build/utils.js:87:12)
2019-02-14T11:59:46.291Z     at exports.requestPromise (/home/myuser/node_modules/apify-client/build/utils.js:158:19)
2019-02-14T11:59:46.294Z     at <anonymous>
2019-02-14T11:59:46.296Z     at process._tickCallback (internal/process/next_tick.js:189:7)
2019-02-14T11:59:46.298Z ERROR: BasicCrawler: runTaskFunction error handler threw an exception. This places the RequestQueue into an unknown state and crawling will be terminated. This most likely happened due to RequestQueue being overloaded and unable to handle Request updates even after exponential backoff. Try limiting the concurrency of the run by using the maxConcurrency option. (error details: type=record-not-found, statusCode=404)
2019-02-14T11:59:46.300Z   ApifyClientError: Record was not found
2019-02-14T11:59:46.302Z     at exports.newApifyClientErrorFromResponse (/home/myuser/node_modules/apify-client/build/utils.js:87:12)
2019-02-14T11:59:46.303Z     at exports.requestPromise (/home/myuser/node_modules/apify-client/build/utils.js:158:19)
2019-02-14T11:59:46.305Z     at <anonymous>
2019-02-14T11:59:46.307Z     at process._tickCallback (internal/process/next_tick.js:189:7)
2019-02-14T11:59:46.309Z ERROR: AutoscaledPool: runTaskFunction failed. (error details: type=record-not-found, statusCode=404)
2019-02-14T11:59:46.311Z   ApifyClientError: Record was not found
2019-02-14T11:59:46.313Z     at exports.newApifyClientErrorFromResponse (/home/myuser/node_modules/apify-client/build/utils.js:87:12)
2019-02-14T11:59:46.315Z     at exports.requestPromise (/home/myuser/node_modules/apify-client/build/utils.js:158:19)
2019-02-14T11:59:46.317Z     at <anonymous>
2019-02-14T11:59:46.319Z     at process._tickCallback (internal/process/next_tick.js:189:7)
2019-02-14T11:59:46.382Z User function threw an exception:
2019-02-14T11:59:46.388Z ApifyClientError: Record was not found
2019-02-14T11:59:46.390Z     at exports.newApifyClientErrorFromResponse (/home/myuser/node_modules/apify-client/build/utils.js:87:12)
2019-02-14T11:59:46.392Z     at exports.requestPromise (/home/myuser/node_modules/apify-client/build/utils.js:158:19)
2019-02-14T11:59:46.394Z     at <anonymous>
2019-02-14T11:59:46.396Z     at process._tickCallback (internal/process/next_tick.js:189:7)
jancurn commented 5 years ago

Just a note that the RequestQueue should support the use case where one actors writes to the queue and another one is reading from it. Perhaps the cache should be used only if it's less than N seconds old, and afterwards we can just use underlying storage.