apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.83k stars 682 forks source link

Revisit API of storages #2674

Open janbuchar opened 2 months ago

janbuchar commented 2 months ago

There are multiple considerations:

B4nan commented 2 months ago

We may not need the global accessors (e.g., Dataset.open)

The open methods are actually the only ones we might need.

We should cleanly separate filesystem-backed and memory-only storage

I am still not sure about this, we still need a way to offload things from memory, otherwise, a long-running scape would fail at some point.

janbuchar commented 2 months ago

We should cleanly separate filesystem-backed and memory-only storage

I am still not sure about this, we still need a way to offload things from memory, otherwise, a long-running scape would fail at some point.

As long as the memory-only thing is opt-in and the default is filesystem-backed, I'd argue we don't :shrug:

B4nan commented 2 months ago

I see, so the default would be the file system one. Then I am not sure what benefits this would bring, pretty much all the problems we had were about the file system part. And that implementation would still require a lot of in-memory caching most likely, I would even say it would end up as the same as the current memory storage.

vladfrangu commented 2 months ago

I'd personally be opposed to making fs-backed default, especially for requests (as we've seen the amount of troubles that causes, especially during concurrent accesses)... But we can definitely optimize some parts further I think


Can we support multiple non-named ("default") storages?

Should work with current memory-storage, does it not? memory-storage definitely supports it to my knowledge (.s.getOrCreate() -> new UUID)

B4nan commented 2 months ago

Should work with current memory-storage, does it not? memory-storage definitely supports it to my knowledge (.s.getOrCreate() -> new UUID)

This would work out of box already if we remove the storage manager cache, its not about the storage implementations, its about the fact that you calling open() twice resolves to the same default storage.