denoland / std

The Deno Standard Library
https://jsr.io/@std
MIT License
3.18k stars 623 forks source link

streaming Zip file creation #2237

Open Touffy opened 2 years ago

Touffy commented 2 years ago

I am the developer of client-zip, a very small, pretty fast streaming Zip file generator, written in modern JavaScript and a bit of WASM. A user suggested to me — insistently — that client-zip (or something based on it) should become part of the Deno standard library, presumably alongside the existing archive/tar module. client-zip already runs very well in Deno, by the way.

Why not ?

This is not rhetorical. I actually don't think it's such a good idea, and I'm posting here to explain and see what happens.

The main reason for creating Zip archives instead of the more elegant Tar+Gzip format is because Zip enjoys universal support out of the box. I think it makes more sense to do the work client-side, but maybe I'm not seeing some good use cases.

client-zip has a very different design compared to what's in the Go library and most existing libraries. Instead of instantiating an object to represent the archive and calling methods on it to add files, client-zip exposes a single function that takes an async iterable of inputs and immediately returns an output stream (which is generated lazily from that moment on).

I like my design choice very much, and I think it meshes well with core Deno code which also favors async generators — particularly the wonderful fs/walk module. But it's a big side-step from the guideline of sticking with the Go stdlib, and the interface of the existing tar module. It would look like this if you wanted to zip a directory :

import { walk } from "https://deno.land/std@0.139.0/fs/walk.ts"
import { downloadZip } from "https://unpkg.com/client-zip/index.js"

async function *readFiles() {
  for await (const entry of walk(".", { includeDirs: false }))
    yield await Deno.open(entry.path)
}

// currently this would be a Response, but we could return the ReadableStream directly
const output = downloadZip(readFiles())

client-zip is designed around streaming and therefore never looking ahead at file data. That means, in general (and particularly if you use compression, when that's implemented), if you create a Zip stream to feed an HTTP Response, you won't be able to send a Content-Length with that Response. The upside, of course, is low latency, low memory usage, and no need to write temporary files.

crowlKats commented 2 years ago

But it's a big side-step from the guideline of sticking with the Go stdlib, and the interface of the existing tar module.

The thing with the Go stdlib is that its rather a reference point, and not a strict rule to follow. An example is we are moving various APIs away from Deno.Reader/Writer and instead of web streams for those. Also, in regards to the tar module, there is a PR to switch to web streams for it (https://github.com/denoland/deno_std/pull/1985).

I do think having this would be great. From your quick example, it seems to me that what is downloadZip if it would return a ReadableStream, could be a TransformStream, so quick & dirty example:

async function *readFiles() {
  for await (const entry of walk(".", { includeDirs: false }))
    yield await Deno.open(entry.path)
}

const readable = readableStreamFromIterator(readFiles()).pipeThrough(ZipCompressionStream()); // we can just pass this to Response now

We should also consider how we could make this portable with the web. Maybe a similar approach to my web tar PR?

bartlomieju commented 2 years ago

I agree with @crowlKats; your module would be a great addition to the standard library

Touffy commented 2 years ago

Thanks for the feedback.

Since you're rewriting Tar, I have a few suggestions :

Streams are better for optimized handling of binary data and I/O buffers. But streams of objects are just more complicated than async generators for no benefit. The input happens to be a sequence of objects (basically streams with metadata, or TarOptions in your case). In fact, walk is probably the most obvious way to get the input, and it returns an async iterable. Or you might get the input by mapping an array from a JSON. It's easier to represent and obtain that kind of sequence as an iterable (for many use cases, a plain array is good enough). So I'd skip the readableStreamFromIterator (in your Tar rewrite as well). Besides, Streams are going to be async iterable.

Of course, the output is a single binary file, so that's still a Stream.

Speaking of TarOptions, if you really want to make Tar work in the browser, you'll have to change from accepting file paths (those only work where you can read the filesystem), to accepting actual Files or other types with data included, like client-zip does. Let some code outside the Tar module map the paths to Files with Deno.open or fetch or something else.

The only compatibility problem I can see for client-zip (and any similar file bundler) is that Deno.open gives you a Deno.File, not a WHATWG File. The example I've written wouldn't work out of the box with the browser-compatible client-zip. You'd have to call readableStreamFromReader to get a WHATWG Stream, put it in an object with some metadata, and now you've got something that the browser can understand as well as Deno. Or, more conveniently, wrap that logic into something that takes a file path and returns a real File (should be pretty easy to do in Deno and would definitely deserve to go into the standard library).

Touffy commented 2 years ago

I forgot to mention the stage-2 proposal for map, filter, etc. on (Async)Generators. Makes async iterables even more attractive.

Anyway, if there is indeed a general drive to make Deno modules browser-compatible, then client-zip makes more sense here than I thought. We could probably have the Zip and Tar modules share the same interface (just with different options), both browser-compatible, with a separate Deno-specific mapper for filesystem input.

However, I still don't see much reason to make Zip files on the server. For anything else than sending the file to regular end users, Tar is better. And for end users… well, if you can get their browser to do it (which is the whole point of client-zip), why waste your own server CPU ?

crowlKats commented 2 years ago

However, I still don't see much reason to make Zip files on the server.

I'd say most usages will actually be unzipping than zipping

Touffy commented 2 years ago

I'd say most usages will actually be unzipping than zipping

I think so too. Unzipping is another beast entirely, though, and will never part of client-zip.

When zipping, you can pick just one implementation (that most unzip programs can understand) and do that well. For unzipping, you need to be compatible with all the quirky Zip files generated by lots of different programs and versions since 1989.

Also, unzipping can be streamed sometimes but you can never be sure in advance (in the case of client-zip's output, it's guaranteed not to be stream-extractable), so basically you have to store the whole Zip file somewhere with fast random access, which removes one of the good reasons to do a pure JS implementation.

Given those constraints (on top of the performance issues I already talked about for zipping), I think we're better off just calling a native unzipping utility on the Zip file right after storing it locally.