apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.39k stars 662 forks source link

RequestQueue - Utilization of the local storage space #383

Closed beeirl closed 5 years ago

beeirl commented 5 years ago

Hello, everybody. I am using a RequestQueue using the LocalStorage for my CheerioCrawler. I wonder if the number of generated .json files in the handled folder can increase to infinity or be deleted automatically at some point. I'm asking because I have to handle a dynamic set of URLs in one crawl, that can be very large under certain circumstances.

mnmkng commented 5 years ago

Hi @Druux, yes, the local RequestQueueLocal will grow to infinity. The original purpose of it was to enable local development and testing, rather than to provide a full-featured production level storage. But people seem to be using it this way, so I'd be happy to review a PR that adds this functionality.

beeirl commented 5 years ago

@mnmkng Thanks for your feedback. Sure, I will submit a pull request as soon as I got it working.

mnmkng commented 5 years ago

Good luck then!