Closed manucorporat closed 3 years ago
I'll take a look. BTW you should probably try using unzip
instead of unzipSync
until I fix this issue because it is multithreaded and therefore usually faster.
@manucorporat I took a look at the example file and unfortunately you've come across one of the shortcomings of streaming unzip in general. This ZIP file was created in a streaming manner, i.e. the lengths of each file are not encoded in the archive until the very end of the archive. Therefore, to see when one file ends and the next begins, fflate
has to iterate through each byte in the file and look for a magic number 0x4034B50
that signifies the start of the next file. Unfortunately, 307.dcm
encodes this exact byte sequence at byte 420119, so the decompressor thinks it found a new file when in reality it's still in 307.dcm
. The reason the non-streaming method works is that it has access to the entire file, so it can read the file lengths encoded at the end of the archive instead of trying to find the magic number.
This isn't something that fflate
can resolve; if you look around, you'll see that the reason most people don't offer streaming decompressors is this exact issue.
The conditions for this bug to occur are so specific that I thought it would never happen. I will document this behavior, but unfortunately I don't think there is any way to avoid this problem besides using a non-streaming API. Sorry for this bug; let me know if you have any questions.
Hello @101arrowz, thanks a lot for the amazing response! I am so happy to know all this interesting details about zip and compression, bummer about the streaming issue, i guess the chances of it happening are more likely for big files, or a zip within a zip?
unzipSync
and unzip
is working great! our code is already already running within a worker, and safari doesn't support workers within workers (which is lame), so we are forced to use the sync apis.
We are using fflate at https://set.health/, and we love it!
Out of curiosity, do you know if .tar has a similar problem?
Would you be up for adding a configuration second argument to unzipSync() and unzip() to "filter" files (for performance reasons?)
i guess the chances of it happening are more likely for big files, or a zip within a zip
Pretty much, but the magic number was specifically designed to be quite rare, so it's unfortunate this is a problem.
Out of curiosity, do you know if .tar has a similar problem?
TAR is a far superior format to ZIP IMO. No, it does not suffer from this issue during decompression; however, it can't be streamed in compression.
Since it seems that you're able to choose .zip
or .tar
, it looks like you have control over the file creation process. In that case, note that the zip
command will not yield a streamed ZIP. It's important to note that ZIPs created without streaming can always be decompressed streaming, but ZIPs created with streaming can usually be decompressed streaming. So if you can force the ZIP to be created in a non-streaming manner, you will never face this problem. You can use InfoZIP to do this: zip -r myzip.zip folder/
.
Hey! i guess i change how we generate to not bee in streaming, but it would still be problematic in case someone uploads a streaming generated .zip that has this issue...
Thanks a lot of your help! feel free to close issue :)
How to reproduce This zip seems to reproduce the issue: https://sethealth-customers.b-cdn.net/repo.zip
Try to parse the ZIP file with a single huge chunk (just trying to reproducee the issue):
^ This code will crash with "Invalid zip data".
The problem
fflate seems to unzip correctly this file when using
unzipSync()
, but fails in stream mode: