Open ayende opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-io-compression See info in area-owners.md if you want to be subscribed.
Author: | ayende |
---|---|
Assignees: | - |
Labels: | `area-System.IO.Compression`, `tenet-performance`, `untriaged` |
Milestone: | - |
@carlossanlop this is something that we should consider for the .NET 7 Compression work as it's related to supporting large files (> 2GB)
This is particularly bad when the stream being passed to the ZipArchive constructor is a 10-15 minute download and you want to be processing the contents as they become available.
Please consider ZipArchive.OpenAsync
or zipArchive.GetEntriesAsync
to avoid blocking on I/O, too.
The first thing ZipArchive currently does with a seekable stream is Seek(-18, SeekOrigin.End)
. I'm hoping that can be avoided for non-seekable streams. I'm rather worried that there's no help for it given that .zip files place the directory at the end of the stream. https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure mentions that scanning for file entry headers isn't necessarily valid because a file may have been deleted from the directory. However, I would guess the vast majority of .zip files are built once and never deleted from, and thus would be packed with no unused space around the file entries.
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT mentions streaming .zip files and says that every local file header must have an entry in the central directory. So that takes care of the deleted files concern and shows that streaming is a known use case.
4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.
4.3.5 File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.
To get myself unblocked, I created a proof of concept which successfully reads a streaming .zip file: https://gist.github.com/jnm2/31bdf08357a44c91d01736ad43b9c447
await using var reader = new StreamingZipReader(downloadStream);
while (await reader.MoveToNextEntryAsync(skipDirectories: true, CancellationToken.None))
{
Console.WriteLine($"{reader.CurrentEntry.Name}: {reader.CurrentEntry.Length} bytes");
using var stream = reader.GetCurrentEntryStream();
using var testReader = new StreamReader(stream);
var test = await testReader.ReadToEndAsync();
// (my test download had only text files, and they all looked right!)
}
FWIW with permission from @jnm2 I've published StreamingZipReader as nuget https://www.nuget.org/packages/StreamingZipReader and recently fixed a bug with regard to ZIP64 support.
Description
Analysis
Reason for this is here: https://github.com/dotnet/runtime/blob/00c38c7da82cff64932396897e057cc20bdbe743/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L144
When we pass a non seekable stream to
ZipArchive
- it will read it into aMemoryStream
. That will only work if the data can fit inMemoryStream
. I assume that this is because the Zip format requires seeking (directory, etc).However, this is very surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB.
I'm not sure if there is a way to fix this, given backward compact issues.