dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.35k stars 4.74k forks source link

System.IO.Compression ZipArchive bad performance over network #31460

Open KrabatTilt opened 5 years ago

KrabatTilt commented 5 years ago

I am working with a file format that contains a ZipArchive inside a Package. To access the data following code is applied:

var outer = Package.Open(filename, FileMode.Open, FileAccess.Read, FileShare.Read);
var data = outer.GetPart(new Uri("/Image.data", UriKind.Relative));
var inner = new ZipArchive(data.GetStream(FileMode.Open, FileAccess.Read), ZipArchiveMode.Read, true);

Running that code targetting netFramewok472 takes about 0.5 sec to open a 400MB file on a SMB network share over a 16Mbit network connection. Memory consumption is 42MB.

Running same code targetting netCoreApp30 on same file, took 290 sec ending up with 800MB memory consumtion.

This is a huge performance drop in access time and memory consumtion. Any suggestions where this comes from and how to workaround?

stephentoub commented 5 years ago

cc: @ericstj

ericstj commented 5 years ago

First off: can you change that format? Packages are ZIPs and storing a ZIP inside a ZIP isn't the best for performance. You aren't gaining anything from compressing twice and the nested zip will require reading more than necessary to extract its contents (additional zip overhead and seeks required to read this).

On .NETFramework the Package APIs had a different ZIP implementation that would buffer more (including buffer to temporary file on disk). This would have happened in this case since the ZipArchive API will end up doing random access on that PackagePart. In addition you end up hitting a codepath in ZipArchive where it copies the entire backing stream: https://github.com/dotnet/corefx/blob/bc115700c3ece60acd6b8dbe4b0bdb8f6f80c756/src/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L147. This wouldn't be hit on .NET Framework since the Package APIs would have buffered the part to a file behind the scenes for you. I discussed this a bit here: https://github.com/dotnet/corefx/issues/11669#issuecomment-468027597

To mimic the .NETFramework behavior, try extracting the Part to a temporary file and then opening the ZipArchive over that stream. That should give you similar memory characteristics and hopefully similar performance.

KrabatTilt commented 4 years ago

Sry for the very late answer.

can you change that format?

Not realy as it is a legacy format with a lot of files already beeing in circulation.

Packages are ZIPs and storing a ZIP inside a ZIP isn't the best for performance. You aren't gaining anything from compressing twice and the nested zip will require reading more than necessary to extract its contents (additional zip overhead and seeks required to read this).

The thing is that no compression is used at all. The outer as well as the inner archive are just used as containers and are generated using CompressionLevel.NoCompression. The inner archive holds thousands of small entries and can be seen as a readonly container. The outer archive holds entries containing meta information about the entries of the inner archive.

When no compression is used at all, it is possible to random access all data from the nested acrhive by reading directly from the underlying FileStream (which is seekable) without using DeflateStream in between. And that is how I solved this special case for now, but I had to implement my own custom ZipReader.