It's slow to read zip headers

brendan-duncan / archive

Dart library to encode and decode various archive and compression formats, such as Zip, Tar, GZip, ZLib, and BZip2.

MIT License

401 stars 140 forks source link

It's slow to read zip headers #326

Open lemtea8 opened 6 months ago

lemtea8 commented 6 months ago

I want to read zip file headers so I can list what's inside the zip file, and I discovered that if I use ZipDecoder to read the whole zip file it's time consuming. So after looking at the source code I end up using the ZipDirectory as follows:

final inputStream = InputFileStream(path);
final headers = ZipDirectory.read(inputStream).fileHeaders;

This improves speed by a little, but it is still very slow when dealing with a large amount of zip files(>2000).
Each zip file averagely took 30\~40 ms to read the file headers, and the disk usage is high during the whole process.

For comparison, I uses minizip-ng to do the same job with dart-ffi, and each zip file averagely took around several hundred microseconds(0.5\~0.6 ms).

I do both experiment on a HDD which uses ext4 filesystem.
Is there a better way to use this package or this is some performance problem? Thanks!

brendan-duncan commented 6 months ago

A native FII implementation will always win, this library is written in Dart. The Zip format isn't really great for interpreted performance. The central directory is at the end of the file and you have to search backwards to find the start of the central directory, which is awful for cache performance.

lemtea8 commented 6 months ago

I understand that the dart version is going to be slower, but the difference is a little more than expected.
IMHO, this is mostly about I/O so the language shouldn't affect much.

brendan-duncan commented 6 months ago

InputFileStream is probably getting cache thrashed as ZipDirectory searches for the central directory. I'll have to look at that cache behavior and see if there's anything I can do to improve it. InputFileStream will read in X bytes from the file so it doesn't have to do so much file IO, but it's tuned for reading forward (read next X bytes), and not so much for reading backward (read previous X bytes).

You can read all the bytes and use an InputStream to see how it performs without IO being involved. I'll try to get to profiling the InputFileStream cache behavior with ZipDirectory as soon as I can but I've been swamped with work (as usual) so it might take a bit.

lemtea8 commented 6 months ago

Here's a test result to read 3,000 zip file headers:

Method	Reading headers(avg)	Total
Native (minizip-ng)	<1ms	4.8s
InputStream	<1ms	146s (including readAsBytesSync)
InputFileStream	46ms	159s

I also tested InputFileStream with different buffer size(128B, 1KiB, 16KiB, 256KiB, 1MiB, 4MiB), but the results are more or less the same.

I'll try to get to profiling the InputFileStream cache behavior with ZipDirectory as soon as I can but I've been swamped with work (as usual) so it might take a bit.

Thanks!