Closed androidseb closed 1 year ago
Looks like there might be a fix for that https://github.com/brendan-duncan/archive/pull/270
Thanks for the link, but it seems this is an unrelated issue. The issue I'm facing is not around memory usage, it's around producing corrupted archives for example, where the memory is sufficient but my code silently fails by completing "successfully" while producing a corrupted zip archive...
This is VERY concerning. People shouldn't use this package unless they are SURE it will be used for small files. This just fails silently as said above and you think it finished well.
encoder.zipDirectory(Directory(dirToZip), filename: zipResult);
I tried zipping a directory with 10+GB of files and it just finished without saying anything. I thought all was fine until I went to unzip it.
@brendan-duncan I decided to sponsor you, hoping that you'll be able to look at this issue in the next few weeks.
To be completely honest, as an individual, I'm not sure how many months I will keep this sponsor up, so I'll leave this up to you to decide whether this qualifies as a "sponsor priority" issue or not, your call.
This library you built has already helped me so much in its current state that I won't regret the sponsoring either way.
@androidseb, thank you very much for the sponsorship!
I was discussing this in the other thread by @BananaMasterz as well. In that case, the issue is individual files being added to the zip that are greater than 4GB. Is that the same issue you're having? I've tested zip archives > 4GB and didn't seem to have an issue. It's only when individual files being zipped are > 4GB. I'll have to add the zip extension that allows for that to the encoder. The standard zip format is limited to 4GB file sizes because it was written in the time of FAT32. I'll try and get done as soon as I can, been dealing with a lot of family stuff recently so I'm squeezing in the time as much as I can.
@brendan-duncan thanks for looking at this! The issues I had were not dealing with large archive entries, but a large zip file overall, I tried the following test scenarios:
I can provide code samples next week, I'm away from my computer until then. In the meantime I hope this helps provide context.
Thanks for the examples. I had tested creating 10GB worth of 1MB files and it seemed to me like the library could encode and decode the folder fine. I'll check out if other programs have issues with the resulting archives, like you mention. I don't have Ubuntu installed currently, hopefully the issue is reproducible with 7z or other archivers.
If I don't get to it this week, it's not because I'm slacking, lots going on.
If I don't get to it this week, it's not because I'm slacking, lots going on.
No problem at all, no pressure here, take all the time you need, I'm grateful you're aware of this and looking at it.
I had tested creating 10GB worth of 1MB files and it seemed to me like the library could encode and decode the folder fine
That's interesting, I'm pretty sure that this didn't work on my machine, I would have assumed the following scenario would be the easiest to reproduce, but maybe there is something specific to my setup...
This is what I experienced :
If you can't easily reproduce this scenario, I would recommend waiting for next week, once I'm in front of my computer, I can craft a very straightforward code sample for you to try...
I think I have a repo now. My previous test was compressing the manufactured data too much, so the zip file was smaller than 4GB. I changed the compression level to produce a bigger zip, and it's producing the problematic zip.
Ok, I figured it out. I need to implement Zip64 for this issue too, not just the individual files being > 4GB.
The regular zip format has a central directory at the end of the zip file, which is a list of all the files, and their offset into the zip file. It also has an end of central directory block at the end of the file, which has an offset to the start of the central directory.
These offsets are 32-bit. So if the zip file is > 4GB, those offsets past 4GB will be bad.
Zip64 has separate offsets and size data for the files in the central directory, and a separate end of central directory, with 64-bit sizes and offsets. https://en.wikipedia.org/wiki/ZIP_(file_format)#ZIP64
So the fix for both this issue and @BananaMasterz is the same, I need to add zip64 to the encoder. I'll squeeze in time here and there to do the work implementing it. Shouldn't be too hard, just need the time.
An update, I'm back from being out of town and am close to finishing adding zip64 to the encoder.
I still need to do more testing, but I think I got zip64 encoding working. You can test it with the git version, I'll publish after I test some more.
It should also reduce the memory used when encoding files to because before it had to read in the whole file to calculate the crc32 and write it to disk, and now it will do that in smaller chunks.
That's awesome, thanks for getting to this so quickly!
I'm back at my computer today, I'll test this out and let you know if I find anything.
UPDATE: from my testing, this seems to be working, I was able to zip a 5GB archive with 50000 images of around 100KB, and then unzip it as well. I'll keep testing.
@brendan-duncan I've been trying to look at how we could write good tests for this...
I gave it a shot by creating two tests here: https://github.com/brendan-duncan/archive/compare/main...androidseb:archive:tests/zip64
Unfortunately, the first test is taking a long time to run (~ 5 minutes on my machine), and the second test requires a significant amount of RAM to work (it probably won't work well unless you have at least 32GB of RAM on your machine). I could work on improving the RAM usage, but the test execution time probably can't be improved much...
I could work on this a little more, but I'd like your thoughts on this before I dig deeper: would that help you, or do you want to avoid having long test running times?
@androidseb Thanks. Yeah, making a test is hard because it times out the test runner. I had some tests that generated directories of files, so it was fairly minimal memory-wise but it certainly takes a while to encode and decode.
There are some opportunities for improving performance. Calculating the CRC32 of a file while zipping is a big chunk of time. It currently reads in chunks to calculate the CRC32, as e separate step from compressing the file into the zip. I'm thinking I could combine those two steps, calculate the CRC32 while it's compressing the file and save a bunch of time, but I still need to look into that.
Another memory / performance challenge is that in order to reduce memory, it would need to read/write to disk more frequently to work on smaller chunks. But disk IO is really slow. So in order to reduce memory, it pays a significant performance cost, and to improve performance, it pays a memory cost.
Sometimes Dart is hard.
@brendan-duncan OK, thanks for the context... Based on what you wrote there, I'll pause that zip64 large files testing idea for now, it doesn't seem viable at the moment - please let me know if you want me to pick this up again, I'm willing to help if I can.
Thanks. I'll get this published to a new version soon, was just hoping to manually do some more testing first to make sure I didn't break anything.
Looking through the code, there's so much that could be done better. I did write this library 10 years ago when Web was the primary platform for Dart. For example, it's currently compressing files to memory, which means the entire compressed file will be in memory, when I could have it stream the compression to disk to reduce the required memory. The zip format makes calculating the CRC32 (checksum) during that process difficult because the CRC32 value needs to be written before the compressed file data, and calculating the CRC32 separately like it does now is a slow process. But maybe I could have it calculate the CRC32 while it's stream compressing the file, and then inject the CRC32 into the right place in the file after the compressed data is written. Lots of ideas, but work and family severely limit the time I get to play with these projects.
I published the update in 3.4.0. I'll work on optimizations another time. I'll close this issue.
Thank you very much for your work on this!
Thanks for creating this great library, it's really useful! I'm facing an issue on which I'm hoping I could get some guidance.
I read in some places that the maximum size of a zip archive is 4GB, but I've seen software able to produce larger zip files (e.g. the ZipOutputStream class of the Java SDK), so I'm not too sure what to expect.
While using this library, I found myself limited to 4GB when producing a zip archive or extracting a zip archive. Beyond 4GB, my code either fails to read the content of the zip file (when reading) or produces a corrupted zip archive (when writing).
Is this limitation expected from this library, or could I be doing something wrong? Thanks in advance for the guidance!