dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.66k stars 4.58k forks source link

ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

Open ayende opened 2 years ago

ayende commented 2 years ago

Description

NetworkStream stream = await GetStreamFromUrl(remoteUrlFor_LARGE_file);
ZipArchive.Open(stream, ZipArchiveMode.Read); // <-- will fail here with `Stream was too long.`

Analysis

Reason for this is here: https://github.com/dotnet/runtime/blob/00c38c7da82cff64932396897e057cc20bdbe743/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L144

When we pass a non seekable stream to ZipArchive - it will read it into a MemoryStream. That will only work if the data can fit in MemoryStream. I assume that this is because the Zip format requires seeking (directory, etc).

However, this is very surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB.

I'm not sure if there is a way to fix this, given backward compact issues.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io-compression See info in area-owners.md if you want to be subscribed.

Issue Details
### Description ``` NetworkStream stream = await GetStreamFromUrl(remoteUrlFor_LARGE_file); ZipArchive.Open(stream, ZipArchiveMode.Read); // <-- will fail here with `Stream was too long.` ``` ### Analysis Reason for this is here: https://github.com/dotnet/runtime/blob/00c38c7da82cff64932396897e057cc20bdbe743/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L144 When we pass a non seekable stream to `ZipArchive` - it will read it into a `MemoryStream`. That will only work if the data can fit in `MemoryStream`. I assume that this is because the Zip format requires seeking (directory, etc). However, this is *very* surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB. I'm not sure if there is a way to fix this, given backward compact issues.
Author: ayende
Assignees: -
Labels: `area-System.IO.Compression`, `tenet-performance`, `untriaged`
Milestone: -
adamsitnik commented 2 years ago

@carlossanlop this is something that we should consider for the .NET 7 Compression work as it's related to supporting large files (> 2GB)

jnm2 commented 2 years ago

This is particularly bad when the stream being passed to the ZipArchive constructor is a 10-15 minute download and you want to be processing the contents as they become available.

Please consider ZipArchive.OpenAsync or zipArchive.GetEntriesAsync to avoid blocking on I/O, too.

The first thing ZipArchive currently does with a seekable stream is Seek(-18, SeekOrigin.End). I'm hoping that can be avoided for non-seekable streams. I'm rather worried that there's no help for it given that .zip files place the directory at the end of the stream. https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure mentions that scanning for file entry headers isn't necessarily valid because a file may have been deleted from the directory. However, I would guess the vast majority of .zip files are built once and never deleted from, and thus would be packed with no unused space around the file entries.

https://github.com/dotnet/runtime/blob/78593b9e095f974305b2033b465455e458e30267/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L520-L522

jnm2 commented 2 years ago

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT mentions streaming .zip files and says that every local file header must have an entry in the central directory. So that takes care of the deleted files concern and shows that streaming is a known use case.

4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.

4.3.5 File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.

jnm2 commented 2 years ago

To get myself unblocked, I created a proof of concept which successfully reads a streaming .zip file: https://gist.github.com/jnm2/31bdf08357a44c91d01736ad43b9c447

await using var reader = new StreamingZipReader(downloadStream);

while (await reader.MoveToNextEntryAsync(skipDirectories: true, CancellationToken.None))
{
    Console.WriteLine($"{reader.CurrentEntry.Name}: {reader.CurrentEntry.Length} bytes");

    using var stream = reader.GetCurrentEntryStream();
    using var testReader = new StreamReader(stream);
    var test = await testReader.ReadToEndAsync();
    // (my test download had only text files, and they all looked right!)
}
bjornharrtell commented 2 months ago

FWIW with permission from @jnm2 I've published StreamingZipReader as nuget https://www.nuget.org/packages/StreamingZipReader and recently fixed a bug with regard to ZIP64 support.