apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.8k stars 680 forks source link

Performance issues with KVS holding lots of items locally #2723

Open B4nan opened 1 month ago

B4nan commented 1 month ago

1 - Extract the attached file and do npm install 2 - Run it once : npm start 3 - It will generate KV with 100k keys. If you notice, KV will initialize in no time in first run 4 - Once complete, run it again with npm start 5 - Notice that it will freeze here. Ideally, that line should not do anything with size of data.

bottleneck-poc.zip

import { Actor } from 'apify';

await Actor.init();
console.log(`store initialisation started. It will freeze here when you run this POC second time.`);
const store = await Actor.openKeyValueStore('100k-keys');
console.log(`store initialised successfull`);

for (let i = 0; i < 100001; i++) {
    await store.setValue(`number-${i}`,"1");
    console.log(`storing ${i}`);
}   

console.log('100k KV stored');

await Actor.exit();

Originally posted by @dhrumil4u360 in https://github.com/apify/crawlee/discussions/2722#discussioncomment-11053044

vladfrangu commented 1 month ago

A very painful dupe of https://github.com/apify/crawlee/issues/2248 ... It's really starting to bite our ass more and more because we cannot just preload everything without scanning the whole dir (or storing a metadata file mandatory)