dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.62k stars 4.56k forks source link

[API Proposal]: Allow opening (raw) compressed archive entries in ZipArchiveEntry #63155

Open PJB3005 opened 2 years ago

PJB3005 commented 2 years ago

Background and motivation

Right now, ZipArchive only supports opening entries compressed with Stored, Deflate and Deflate64. While there are open issues about adding support for more specified methods such as LZMA, I would like to propose an orthogonal solution to this problem.

Allow access to the raw compressed streams in the zip file, and the compression method flag in the entry. This opens up a few possibilities:

I am far from an expert on the zip file format, but from my rudimentary understanding of it, this should be possible?

API Proposal

namespace System.IO.Compression
{
    public class ZipArchiveEntry
    {
        public ZipCompressionMethod CompressionMethod { get; }
        public Stream OpenRaw();
    }

    public class ZipArchive
    {
        public ZipArchiveEntry CreateEntry(string entryName, ZipCompressionMethod compression);
    }

    public enum ZipCompressionMethod : short
    {
        // Corresponds to the compression method described by APPNOTE.TXT section 4.4.5
        Stored = 0,
        Deflate = 8,
        Bzip2 = 12,
        Lzma = 14,
        Zstd = 93
    }
}

API Usage

Using third-party decompression streams with ZipArchive:

var zipArchive = new ZipArchive(..., ZipArchiveMode.Read);
var entry = zipArchive.GetEntry("foo.json");
Debug.Assert(entry.CompressionMethod == ZipCompressionMethod.Zstd);

// Imagine a ZstdStream from a third-party library.
var stream = new ZstdStream(entry.OpenRaw(), CompressionMode.Decompress);

Copying compressed blobs between zip files:

ZipArchive a = ...;
ZipArchive b = ...;

var aEntry = a.GetEntry("foo.json");
var bEntry = b.CreateEntry("foo.json", aEntry.CompressionMethod);

aEntry.OpenRaw().CopyTo(bEntry.OpenRaw());

Alternative Designs

No response

Risks

No response

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io-compression See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Right now, `ZipArchive` only supports opening entries compressed with `Stored`, `Deflate` and `Deflate64`. While there are open issues about adding support for more specified methods such as LZMA, I would like to propose an orthogonal solution to this problem. Allow access to the raw compressed streams in the zip file, and the compression method flag in the entry. This opens up a few possibilities: * Allows developers to use third-party compression libraries to get support for algorithms like zstd or LZMA themselves. * Can be used in advanced scenarios when, for example, copying between zip files, to avoid having to decompress and re-compress data. I am far from an expert on the zip file format, but from my rudimentary understanding of it, this should be possible? ### API Proposal ```C# namespace System.IO.Compression { public class ZipArchiveEntry { public ZipCompressionMethod CompressionMethod { get; } public Stream OpenRaw(); } public class ZipArchive { public ZipArchiveEntry CreateEntry(string entryName, ZipCompressionMethod compression); } public enum ZipCompressionMethod : short { // Corresponds to the compression method described by APPNOTE.TXT section 4.4.5 Stored = 0, Deflate = 0, Bzip2 = 12, Lzma = 14, Zstd = 93 } } ``` ### API Usage Using third-party decompression streams with `ZipArchive`: ```cs var zipArchive = new ZipArchive(..., ZipArchiveMode.Read); var entry = zipArchive.GetEntry("foo.json"); Debug.Assert(entry.CompressionMethod == ZipCompressionMethod.Zstd); // Imagine a ZstdStream from a third-party library. var stream = new ZstdStream(entry.OpenRaw(), CompressionMode.Decompress); ``` Copying compressed blobs between zip files: ```cs ZipArchive a = ...; ZipArchive b = ...; var aEntry = a.GetEntry("foo.json"); var bEntry = b.CreateEntry("foo.json", aEntry.CompressionMode); aEntry.OpenRaw().CopyTo(bEntry.OpenRaw()); ``` ### Alternative Designs _No response_ ### Risks _No response_
Author: PJB3005
Assignees: -
Labels: `api-suggestion`, `area-System.IO.Compression`, `untriaged`
Milestone: -
AlgorithmsAreCool commented 2 years ago

I have a real-world use case for this also. I recently implemented my own incomplete parser for ZIP archives to use LibDeflate as the decompressor, which got me some nice speedups. It would be nice to be able to use the structure parsing with my own compression libs.

PJB3005 commented 2 years ago

My use cases are that I want to be able to use zip files (because it's a standard format) but with LZMA (significant space savings for my use case) while also being able to instantly dump these blobs into an SQLite DB (while still compressed). Another use case I have is that I want to basically use zip files as an object storage from an API and being able to use the compressed blobs to throw them over the wire directly would be great.

This would hit multiple birds with one stone.

Clockwork-Muse commented 2 years ago

Allows developers to use third-party compression libraries to get support for algorithms like zstd or LZMA themselves.

Having an enum that requires a third-party library to supply that compression algorithm is likely to cause confusion.

At least some compression libraries add a header to the compressed stream - that being the case, if the constructor instead took something like

public interface IZipCompressionStream {
    public string CompressionMethod;
    public ReadOnlySpan<byte> Header;
    public Stream Compress(Stream raw);
    public bool TryDecompress(Stream compressed, out Stream raw);
    public Stream Decompress(Stream compressed);
}

... this would allow for arbitrary compression methods, including ones not currently envisioned

PJB3005 commented 2 years ago

Having an enum that requires a third-party library to supply that compression algorithm is likely to cause confusion.

It is a lower level API that simply exposes more information about the underlying zip file format. Python also exposes the ZipInfo.compress_type field in its zipfile module (but no ability to access the raw stream, AFAICT).

Limiting the enum members to the compression methods supported by .NET today would be an option, which I suppose is closer to what Python does in this regard.

At least some compression libraries add a header to the compressed stream - that being the case, if the constructor instead took something like

Relying on such headers is silly for zip files, since they already have a standardized 2-byte entry field for compression method.

This entire IZipCompressionStream seems like a very complex solution and does not address the other point (access to raw blobs, although you could probably abuse it to achieve with many silly hoops).

svick commented 2 years ago

@Clockwork-Muse I think the API should follow the standard (though which of the specified compression methods should be named members of the enum is up for debate), instead of inventing its own way of specifying the compression method, that may or may not be useful in the future. Or do you have an example where what you're proposing would be useful today?

Clockwork-Muse commented 2 years ago

Relying on such headers is silly for zip files, since they already have a standardized 2-byte entry field for compression method.

Ah, I was not aware that zip itself listed the possible methods, mybad.

adamsitnik commented 2 years ago

@carlossanlop what is your take on this? Would adding such API help to implement algorithms that are currently not supported OOTB?

jeffhandley commented 1 year ago

Thanks for this suggestion, @PJB3005. I'm moving this to Future, but I've also referenced it in #62658 so that we look at it alongside the LZMA and other potential investments during our .NET 8 planning.