apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.35k stars 659 forks source link

SDK_SESSION_POOL_STATE growing infinitely on crawler reruns #2074

Closed barjin closed 1 year ago

barjin commented 1 year ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

const { CheerioCrawler } = require('./packages/cheerio-crawler/dist/index');

const c = new CheerioCrawler({
    requestHandler: async ({ request }) => {
        console.log(request.url);
    },
});

setInterval(() => {
    c.run([`https://jindrich.bar/${Math.random().toString(36).substring(7)}`]);
}, 1000);

Observe the storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json. Subsequent crawler runs keep appending new lines to it, never purging this file.

If run for long enough, it could possibly cause a memory leak (like in the case of #2031 ).

Code sample

No response

Package version

3.5.4

Node.js version

Node.js 16, 18, 20

Operating system

Linux Mint, amd64

Apify platform

I have tested this on the next release

No response

Other context

No response

B4nan commented 1 year ago

Hmm sounds like I forgot to reset the session pool stats in here:

https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L776-L777

barjin commented 1 year ago

I guess there are different ways to go around this, I also think that purging the KeyValueStore doesn't really work correctly - if I purge the storage manually (by changing onlyPurgeOnce to false), I get this error message:

Error: Could not find file at /home/jindrichbar/Desktop/apify/crawlee/storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json
    at KeyValueFileSystemEntry.get (/home/jindrichbar/Desktop/apify/crawlee/packages/memory-storage/dist/fs/key-value-store/fs.js:70:23)

which is probably caused by some mismatch between the in-memory KVS state and the on-disk files.

Either way, if you have a quick fix that solves this, I'm all ears :) You're saying to "reset the session pool stats" - how to do that? I don't really see a reset (or similar) method on the SessionPool class.

B4nan commented 1 year ago

We need to add such method, just like we have stats.resetStore.