apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.57k stars 666 forks source link

Consider avoiding adding suffixes to files in MemoryStorageClient #2710

Open janbuchar opened 1 month ago

janbuchar commented 1 month ago

There is currently an automagic mechanism that stores binary files with a ".bin" extension and text files with a ".txt" extension. Is the comfort for Windows users worth the complexity?

janbuchar commented 1 month ago

This surprised us in https://github.com/apify/crawlee-python/pull/572

vladfrangu commented 1 month ago

If memory serves me right I did this to make it similar to what the API does. That said my memory is rusty so take with a pinch of salt

B4nan commented 1 month ago

I also believe this is about behaving the same as the platform/API, and we surely want this to behave the same as the apify client.

janbuchar commented 1 month ago

Could you elaborate what the API does? :slightly_smiling_face: I tried adding a key with no content type and it got stored just like that, no extension out of nowhere

vladfrangu commented 1 month ago

Can you check what the platform does when downloading such a file?

janbuchar commented 1 month ago

If you mean this, image

it downloads the file with no suffix

vladfrangu commented 1 month ago

hmm, then it might've just been a port from old-but-still-alive local storage 😓

B4nan commented 1 month ago

Better to open this on slack, i am not opposed to changing this, but it would be great to first understand why it was like that (and I doubt anyone from our team will know).

vladfrangu commented 1 month ago

4 years ago 🥲 Google Chrome Beta - 2024-10-14 at 16 21 36@2x