brendan-duncan / archive

Dart library to encode and decode various archive and compression formats, such as Zip, Tar, GZip, ZLib, and BZip2.
MIT License
403 stars 140 forks source link

[Feature] Encoding and decoding of ZIP64 archives (e.g. handling archives of 4GB or more) #269

Closed androidseb closed 1 year ago

androidseb commented 1 year ago

Thanks for creating this great library, it's really useful! I'm facing an issue on which I'm hoping I could get some guidance.

I read in some places that the maximum size of a zip archive is 4GB, but I've seen software able to produce larger zip files (e.g. the ZipOutputStream class of the Java SDK), so I'm not too sure what to expect.

While using this library, I found myself limited to 4GB when producing a zip archive or extracting a zip archive. Beyond 4GB, my code either fails to read the content of the zip file (when reading) or produces a corrupted zip archive (when writing).

Is this limitation expected from this library, or could I be doing something wrong? Thanks in advance for the guidance!

techouse commented 1 year ago

Looks like there might be a fix for that https://github.com/brendan-duncan/archive/pull/270

androidseb commented 1 year ago

Thanks for the link, but it seems this is an unrelated issue. The issue I'm facing is not around memory usage, it's around producing corrupted archives for example, where the memory is sufficient but my code silently fails by completing "successfully" while producing a corrupted zip archive...

BananaMasterz commented 1 year ago

This is VERY concerning. People shouldn't use this package unless they are SURE it will be used for small files. This just fails silently as said above and you think it finished well.

encoder.zipDirectory(Directory(dirToZip), filename: zipResult);

I tried zipping a directory with 10+GB of files and it just finished without saying anything. I thought all was fine until I went to unzip it.

androidseb commented 1 year ago

@brendan-duncan I decided to sponsor you, hoping that you'll be able to look at this issue in the next few weeks.

To be completely honest, as an individual, I'm not sure how many months I will keep this sponsor up, so I'll leave this up to you to decide whether this qualifies as a "sponsor priority" issue or not, your call.

This library you built has already helped me so much in its current state that I won't regret the sponsoring either way.

brendan-duncan commented 1 year ago

@androidseb, thank you very much for the sponsorship!

I was discussing this in the other thread by @BananaMasterz as well. In that case, the issue is individual files being added to the zip that are greater than 4GB. Is that the same issue you're having? I've tested zip archives > 4GB and didn't seem to have an issue. It's only when individual files being zipped are > 4GB. I'll have to add the zip extension that allows for that to the encoder. The standard zip format is limited to 4GB file sizes because it was written in the time of FAT32. I'll try and get done as soon as I can, been dealing with a lot of family stuff recently so I'm squeezing in the time as much as I can.

androidseb commented 1 year ago

@brendan-duncan thanks for looking at this! The issues I had were not dealing with large archive entries, but a large zip file overall, I tried the following test scenarios:

  1. creating a zip file with your library containing 50000 files of approximately 100KB resulting in a zip file of around 5GB : the library seems to work fine, completes with success, I can even open the zip file with an archive manager (Ubuntu's default), but when I try to exctract the zip file, I get an error saying the zip file is damaged - this works fine with 30000 files resulting in an archive of around 3GB...
  2. Creating a large zip file with 50000 files of approximately 100KB from the ubuntu file explorer: the archive is good and can be extracted elsewhere, but your library gives an error - again this works with a 3GB archive

I can provide code samples next week, I'm away from my computer until then. In the meantime I hope this helps provide context.

brendan-duncan commented 1 year ago

Thanks for the examples. I had tested creating 10GB worth of 1MB files and it seemed to me like the library could encode and decode the folder fine. I'll check out if other programs have issues with the resulting archives, like you mention. I don't have Ubuntu installed currently, hopefully the issue is reproducible with 7z or other archivers.

If I don't get to it this week, it's not because I'm slacking, lots going on.

androidseb commented 1 year ago

If I don't get to it this week, it's not because I'm slacking, lots going on.

No problem at all, no pressure here, take all the time you need, I'm grateful you're aware of this and looking at it.

I had tested creating 10GB worth of 1MB files and it seemed to me like the library could encode and decode the folder fine

That's interesting, I'm pretty sure that this didn't work on my machine, I would have assumed the following scenario would be the easiest to reproduce, but maybe there is something specific to my setup...

This is what I experienced :

If you can't easily reproduce this scenario, I would recommend waiting for next week, once I'm in front of my computer, I can craft a very straightforward code sample for you to try...

brendan-duncan commented 1 year ago

I think I have a repo now. My previous test was compressing the manufactured data too much, so the zip file was smaller than 4GB. I changed the compression level to produce a bigger zip, and it's producing the problematic zip.

brendan-duncan commented 1 year ago

Ok, I figured it out. I need to implement Zip64 for this issue too, not just the individual files being > 4GB.

The regular zip format has a central directory at the end of the zip file, which is a list of all the files, and their offset into the zip file. It also has an end of central directory block at the end of the file, which has an offset to the start of the central directory.

These offsets are 32-bit. So if the zip file is > 4GB, those offsets past 4GB will be bad.

Zip64 has separate offsets and size data for the files in the central directory, and a separate end of central directory, with 64-bit sizes and offsets. https://en.wikipedia.org/wiki/ZIP_(file_format)#ZIP64

So the fix for both this issue and @BananaMasterz is the same, I need to add zip64 to the encoder. I'll squeeze in time here and there to do the work implementing it. Shouldn't be too hard, just need the time.

brendan-duncan commented 1 year ago

An update, I'm back from being out of town and am close to finishing adding zip64 to the encoder.

brendan-duncan commented 1 year ago

I still need to do more testing, but I think I got zip64 encoding working. You can test it with the git version, I'll publish after I test some more.

It should also reduce the memory used when encoding files to because before it had to read in the whole file to calculate the crc32 and write it to disk, and now it will do that in smaller chunks.

androidseb commented 1 year ago

That's awesome, thanks for getting to this so quickly!

I'm back at my computer today, I'll test this out and let you know if I find anything.

UPDATE: from my testing, this seems to be working, I was able to zip a 5GB archive with 50000 images of around 100KB, and then unzip it as well. I'll keep testing.

androidseb commented 1 year ago

@brendan-duncan I've been trying to look at how we could write good tests for this...

I gave it a shot by creating two tests here: https://github.com/brendan-duncan/archive/compare/main...androidseb:archive:tests/zip64

Unfortunately, the first test is taking a long time to run (~ 5 minutes on my machine), and the second test requires a significant amount of RAM to work (it probably won't work well unless you have at least 32GB of RAM on your machine). I could work on improving the RAM usage, but the test execution time probably can't be improved much...

I could work on this a little more, but I'd like your thoughts on this before I dig deeper: would that help you, or do you want to avoid having long test running times?

brendan-duncan commented 1 year ago

@androidseb Thanks. Yeah, making a test is hard because it times out the test runner. I had some tests that generated directories of files, so it was fairly minimal memory-wise but it certainly takes a while to encode and decode.

There are some opportunities for improving performance. Calculating the CRC32 of a file while zipping is a big chunk of time. It currently reads in chunks to calculate the CRC32, as e separate step from compressing the file into the zip. I'm thinking I could combine those two steps, calculate the CRC32 while it's compressing the file and save a bunch of time, but I still need to look into that.

brendan-duncan commented 1 year ago

Another memory / performance challenge is that in order to reduce memory, it would need to read/write to disk more frequently to work on smaller chunks. But disk IO is really slow. So in order to reduce memory, it pays a significant performance cost, and to improve performance, it pays a memory cost.

Sometimes Dart is hard.

androidseb commented 1 year ago

@brendan-duncan OK, thanks for the context... Based on what you wrote there, I'll pause that zip64 large files testing idea for now, it doesn't seem viable at the moment - please let me know if you want me to pick this up again, I'm willing to help if I can.

brendan-duncan commented 1 year ago

Thanks. I'll get this published to a new version soon, was just hoping to manually do some more testing first to make sure I didn't break anything.

Looking through the code, there's so much that could be done better. I did write this library 10 years ago when Web was the primary platform for Dart. For example, it's currently compressing files to memory, which means the entire compressed file will be in memory, when I could have it stream the compression to disk to reduce the required memory. The zip format makes calculating the CRC32 (checksum) during that process difficult because the CRC32 value needs to be written before the compressed file data, and calculating the CRC32 separately like it does now is a slow process. But maybe I could have it calculate the CRC32 while it's stream compressing the file, and then inject the CRC32 into the right place in the file after the compressed data is written. Lots of ideas, but work and family severely limit the time I get to play with these projects.

brendan-duncan commented 1 year ago

I published the update in 3.4.0. I'll work on optimizations another time. I'll close this issue.

androidseb commented 1 year ago

Thank you very much for your work on this!