101arrowz / fflate

High performance (de)compression in an 8kB package
https://101arrowz.github.io/fflate
MIT License
2.27k stars 79 forks source link

Streaming unzipping crashes #66

Closed manucorporat closed 3 years ago

manucorporat commented 3 years ago

How to reproduce This zip seems to reproduce the issue: https://sethealth-customers.b-cdn.net/repo.zip

Try to parse the ZIP file with a single huge chunk (just trying to reproducee the issue):

import {Unzip, UnzipInflate} from 'fflate';
import {readFileSync} from 'fs';

const readFile = (file) => {
  return new Promise((resolve, reject) => {
    const chunks = [];
    file.ondata = (err, chunk, final) => {
      if (err) {
        reject(reject);
      }
      if (chunk) {
        chunks.push(chunk);
      }
      if (final) {
        resolve(concatBuffers(chunks));
      }
    };
    file.start();
  });
}

const concatBuffers = (arrays) => {
  const totalLength = arrays.reduce(
    (acc, value) => acc + value.byteLength,
    0
  );
  const output = new Uint8Array(totalLength);
  let offset = 0;
  for (let array of arrays) {
    output.set(array, offset);
    offset += array.byteLength;
  }
  return output;
};

const chunk = new Uint8Array(readFileSync("repo.zip"));
const promises = [];
const unzipper = new Unzip((handler) => {
  promises.push((async () => {
    return {
      name: handler.name,
      stream: await readFile(handler),
    }
  })());
});
unzipper.register(UnzipInflate);
unzipper.push(chunk, false);
unzipper.push(new Uint8Array(), true);
const output = await Promise.all(promises);
console.log(output);

^ This code will crash with "Invalid zip data".

The problem

fflate seems to unzip correctly this file when using unzipSync(), but fails in stream mode:

import {readFileSync} from 'fs';
import {unzipSync} from 'fflate';

const chunk = new Uint8Array(readFileSync("repo.zip"));
const output = unzipSync(chunk);
console.log(output);
101arrowz commented 3 years ago

I'll take a look. BTW you should probably try using unzip instead of unzipSync until I fix this issue because it is multithreaded and therefore usually faster.

101arrowz commented 3 years ago

@manucorporat I took a look at the example file and unfortunately you've come across one of the shortcomings of streaming unzip in general. This ZIP file was created in a streaming manner, i.e. the lengths of each file are not encoded in the archive until the very end of the archive. Therefore, to see when one file ends and the next begins, fflate has to iterate through each byte in the file and look for a magic number 0x4034B50 that signifies the start of the next file. Unfortunately, 307.dcm encodes this exact byte sequence at byte 420119, so the decompressor thinks it found a new file when in reality it's still in 307.dcm. The reason the non-streaming method works is that it has access to the entire file, so it can read the file lengths encoded at the end of the archive instead of trying to find the magic number.

This isn't something that fflate can resolve; if you look around, you'll see that the reason most people don't offer streaming decompressors is this exact issue.

The conditions for this bug to occur are so specific that I thought it would never happen. I will document this behavior, but unfortunately I don't think there is any way to avoid this problem besides using a non-streaming API. Sorry for this bug; let me know if you have any questions.

manucorporat commented 3 years ago

Hello @101arrowz, thanks a lot for the amazing response! I am so happy to know all this interesting details about zip and compression, bummer about the streaming issue, i guess the chances of it happening are more likely for big files, or a zip within a zip?

unzipSync and unzip is working great! our code is already already running within a worker, and safari doesn't support workers within workers (which is lame), so we are forced to use the sync apis.

We are using fflate at https://set.health/, and we love it!

Out of curiosity, do you know if .tar has a similar problem?

Would you be up for adding a configuration second argument to unzipSync() and unzip() to "filter" files (for performance reasons?)

101arrowz commented 3 years ago

i guess the chances of it happening are more likely for big files, or a zip within a zip

Pretty much, but the magic number was specifically designed to be quite rare, so it's unfortunate this is a problem.

Out of curiosity, do you know if .tar has a similar problem?

TAR is a far superior format to ZIP IMO. No, it does not suffer from this issue during decompression; however, it can't be streamed in compression.

Since it seems that you're able to choose .zip or .tar, it looks like you have control over the file creation process. In that case, note that the zip command will not yield a streamed ZIP. It's important to note that ZIPs created without streaming can always be decompressed streaming, but ZIPs created with streaming can usually be decompressed streaming. So if you can force the ZIP to be created in a non-streaming manner, you will never face this problem. You can use InfoZIP to do this: zip -r myzip.zip folder/.

manucorporat commented 3 years ago

Hey! i guess i change how we generate to not bee in streaming, but it would still be problematic in case someone uploads a streaming generated .zip that has this issue...

Thanks a lot of your help! feel free to close issue :)