dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.42k stars 4.76k forks source link

Add support for Zstandard to System.IO.Compression #59591

Open carlossanlop opened 3 years ago

carlossanlop commented 3 years ago

Zstandard (or Zstd) is a fast compression algorithm that was published by Facebook in 2015, and had its first stable release in May 2021.

Their official repo offers a C implementation. https://github.com/facebook/zstd

Data compression mechanism specification: https://datatracker.ietf.org/doc/html/rfc8478

Features:

It's used by:

We could offer a stream-based class, like we do for Deflate with DeflateStream or GZipStream, but we should also consider offering a stream-less static class, since it's a common request.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/area-system-io-compression See info in area-owners.md if you want to be subscribed.

Issue Details
[Zstandard](https://en.wikipedia.org/wiki/Zstandard) (or Zstd) is a fast compression algorithm that was published by Facebook in 2015, and had its first stable release in May 2021. Their official repo offers a C implementation. https://github.com/facebook/zstd Data compression mechanism specification: https://datatracker.ietf.org/doc/html/rfc8478 Features: - It is faster than Deflate, especially in decompression, while offering a similar compression ratio. - It's maximum compression level is similar to that of lzma and performs better than lza and bzip2. - It reached the [Pareto Frontier](https://en.wikipedia.org/wiki/Pareto_efficiency), as it decompresses faster than any other currently-available algorithm with similar or worse compression ratio. - It supports multi-threading. - It can be saved to a *.zst file. - It has a dual BSD+GPLv2 license. We would be using the BSD license. It's used by: - The Linux Kernel as a compression option for btrfs and SquashFS since 2017. - FreeBSD for coredumps. - AWS RedShift for databases. - Canonical, Fedora and ArchLinux for their package managers. - Nintendo Switch to compress its files. We could offer a stream-based class, like we do for Deflate with `DeflateStream`, but we should also consider offering a stream-less static class, since it's a common request.
Author: carlossanlop
Assignees: -
Labels: `api-suggestion`, `area-System.IO.Compression`
Milestone: -
manandre commented 3 years ago

It would be a great enhancement for .Net, but also for the public visibility of this impressive compression algorithm. If you accept it, I can contribute to make it happen. I already foresee multiple steps:

Open questions:

carlossanlop commented 3 years ago

Thank you, @manandre for your offer!

Let's start by discussing the stream API.

I think it makes sense for the stream class to look very similar to Deflate, since both would only wrap a compression algorithm (unlike the Zip, GZip, ZLib APIs, which additionally represent a compression/archiving format).

I am thinking we can avoid creating too many constructors by creating a separate ZStandardOptions class to specify the configuration values.

The ZStandardOptions class will allow specifying the compression level using an integer (and will throw if specifying an out-of-bounds value). This will help avoid falling into the typical CompressionLevel limitation of only 4 values. But, if the user desires to use it anyway, we can provide a constructor that takes a CompressionLevel and converts it to a predefined value from the compression level range allowed by ZStandard, which goes from 1 to 22, with 3 being default. The user should also be able to specify negative levels, according to the manual:

The library supports regular compression levels from 1 up to ZSTD_maxCLevel(), which is currently 22. Levels >= 20, labeled --ultra, should be used with caution, as they require more memory. The library also offers negative compression levels, which extend the range of speed vs. ratio preferences. The lower the level, the faster the speed (at the cost of compression).

Questions

namespace System.IO.Compression
{
    public class ZStandardOptions
    {
        /// <summary>Allow mapping the CompressionLevel enum to predefined levels for ZStandard:
        /// - CompressionLevel.NoCompression = 1, // Official normal minimum
        /// - CompressionLevel.Fastest = 1,       // Official normal minimum
        /// - CompressionLevel.Optimal = 3,       // Official default: ZSTD_CLEVEL_DEFAULT
        /// - CompressionLevel.SmallestSize = 22  // Official maximum: ZSTD_MAX_CLEVEL
        /// </summary>
        public ZStandardOptions(CompressionLevel level);
        // Min = ZSTD_minCLevel() which can be negative, Max=ZSTD_maxCLevel()=22, Default=ZSTD_CLEVEL_DEFAULT=3, throw if out-of-bounds
        int CompressionLevel { get; set; }
        CompressionMode Mode { get; set; }
        bool LeaveOpen { get; set; }
        static int MaxCompressionLevel { get; } // P/Invoke for current maximum: 22
    }

    public class ZStandardStream : Stream
    {
        public ZStandardStream(Stream stream, ZStandardOptions? options); // If options null, then use default values
        public Stream BaseStream { get; }
        public override bool CanRead { get; }
        public override bool CanSeek { get; }
        public override bool CanWrite { get; }
        public override long Length { get; }
        public override long Position { get; set; }
        public override IAsyncResult BeginRead(byte[] buffer, int offset, int count, AsyncCallback? asyncCallback, object? asyncState);
        public override IAsyncResult BeginWrite(byte[] buffer, int offset, int count, AsyncCallback? asyncCallback, object? asyncState);
        public override void CopyTo(Stream destination, int bufferSize);
        public override Task CopyToAsync(Stream destination, int bufferSize, CancellationToken cancellationToken);
        protected override void Dispose(bool disposing);
        public override ValueTask DisposeAsync();
        public override int EndRead(IAsyncResult asyncResult);
        public override void EndWrite(IAsyncResult asyncResult);
        public override void Flush();
        public override Task FlushAsync(CancellationToken cancellationToken);
        public override int Read(byte[] buffer, int offset, int count);
        public override int Read(Span<byte> buffer);
        public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken);
        public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default(CancellationToken));
        public override int ReadByte();
        public override long Seek(long offset, SeekOrigin origin);
        public override void SetLength(long value);
        public override void Write(byte[] buffer, int offset, int count);
        public override void Write(ReadOnlySpan<byte> buffer);
        public override void WriteByte(byte value); // ZLibStream overrides it, but not Deflate/GZipStream
        public override Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken);
        public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default(CancellationToken));
    }
}
manandre commented 3 years ago
agocke commented 3 years ago

FYI @VSadov this may be particularly interesting to single-file compression as it is supposed to be very fast for decompression.

This might mean we would need deeper runtime integration to be usable during bundler loading.

GSPP commented 3 years ago

How does the multi-threading work internally? Does it integrate somehow with the usual .NET infrastructure (TaskScheduler and such)? Or does the library start native threads?

I wonder about that because sometimes you need threading to play nice with what else lives in the same process. In a web app, multi-threading could cause load spikes that crowd out request work from the CPU. Reducing the DOP is only a partial fix because multiple parallel compression jobs would again saturate all cores and cause the problem to reappear. Isolating such work onto a custom thread pool can be a solution and it would not work if the library starts its own threads.

Another concern would be startup overhead for multi-threading inside the library. Is there thread pooling?


It seems to me that CompressionMode should be a mandatory constructor argument. There is no sensible default and without that argument the meaning of the code is unclear.

bool LeaveOpen is about the stream, not about compression. In my opinion, it does not belong into the options class. It should be a constructor argument specific for the stream. This option would, for example, not apply for a static helper method static byte[] Compress(byte[] data, ZStandardOptions? options). The options object would now carry around ignored options.

manandre commented 3 years ago

About thread pooling, the zstd.h header file contains:

/* ! Thread pool :
 * These prototypes make it possible to share a thread pool among multiple compression contexts.
 * This can limit resources for applications with multiple threads where each one uses
 * a threaded compression mode (via ZSTD_c_nbWorkers parameter).
 * ZSTD_createThreadPool creates a new thread pool with a given number of threads.
 * Note that the lifetime of such pool must exist while being used.
 * ZSTD_CCtx_refThreadPool assigns a thread pool to a context (use NULL argument value
 * to use an internal thread pool).
 * ZSTD_freeThreadPool frees a thread pool, accepts NULL pointer.
 */
typedef struct POOL_ctx_s ZSTD_threadPool;
ZSTDLIB_API ZSTD_threadPool* ZSTD_createThreadPool(size_t numThreads);
ZSTDLIB_API void ZSTD_freeThreadPool (ZSTD_threadPool* pool);  /* accept NULL pointer */
ZSTDLIB_API size_t ZSTD_CCtx_refThreadPool(ZSTD_CCtx* cctx, ZSTD_threadPool* pool);
VSadov commented 3 years ago

Zstandard would be very useful to single-file compression. We currently use ZLib/Deflate as it is available in the runtime, but would prefer something faster as impact of decompression is very noticeable at start up.

We did examine lz4 and Zstd as alternative choices of which lz4 is faster at decompression, but Zstd would allow to keep the same compression ratio as with Deflate.

If there is Zstd support in the runtime, single-file compression will definitely switch to it.

GSPP commented 3 years ago

Here are some interesting benchmarks: https://github.com/google/brotli/issues/553. ZStandard offers a really nice trade-off for speed and compression ratio.

image

iamcarbon commented 1 year ago

It looks like Chrome may also be getting support for decoding zstd encoded content, making this also relevant to web / cloud scenarios.

https://chromestatus.com/feature/6186023867908096

Putting in my vote or support, and hoping to see this prioritized in the .NET 9.0 planning.

UPDATE: Chrome has confirmed that they are shipping zstd support in v123.

manandre commented 1 year ago

I have open https://github.com/dotnet/aspnetcore/issues/50643 to support the zstd Content-Encoding in ASP .NET Core. It is currently considered as blocked by the support of the ZStandard compression in the .NET Runtime. @carlossanlop Can we make it happen in .NET 9? Indeed I am still ready to help on this topic.

alexandrehtrb commented 1 year ago

+1

dev-tony-hu commented 9 months ago

Is there any plan to support it in Net 9.0?

YohanSciubukgian commented 8 months ago

Chrome 123 release support zstd

Could you consider it for .NET 9 ?

QuinnDamerell commented 5 months ago

It's super cool to see they released this in Chrome. I think the biggest motivating factor for getting this work done is that ASP.NET can support zstd as an out-of-the-box encoding option.

It looks like Facebook.com is already serving webpages with zstd compression; adding it to the dotnet webstack would be amazing!

Most implementations bind to the native Facebook libs, but there are a few existing c# projects that are ports, like: https://github.com/oleg-st/ZstdSharp

rgueldenpfennig commented 4 months ago

Chrome 123 release support zstd

* https://developer.chrome.com/blog/new-in-chrome-123#more

* https://github.com/facebook/zstd/releases/tag/v1.5.6

Could you consider it for .NET 9 ?

Since the 126 release Mozilla Firefox also supports zstd compression: https://www.mozilla.org/en-US/firefox/126.0/releasenotes/

Mrgaton commented 3 months ago

This for net 9 would be awesome, it would also be great for other algorithms like lzma2.

siyavash1984 commented 1 month ago

I noticed that this issue has been open for a few years now, and I was wondering if there are any plans to add Zstandard (Zstd) support to .NET. If not, I’d be happy to contribute to help implement this feature.

Given the performance benefits and the wide adoption of Zstd, I think it would be a great addition to the framework. If there are any steps or guidelines you can share, I’d love to assist in moving this forward.

Looking forward to your feedback and guidance!

Thanks!

Mrgaton commented 1 month ago

yes pls zstd lzma2 and 7z to net 9

carlossanlop commented 1 month ago

@siyavash1984 thank you! We still need to propose the APIs first. Here's the process: https://github.com/dotnet/runtime/blob/43813ac73242fa78c463d456bf755e3a6622b5d7/docs/project/api-review-process.md

At the moment we have this initial proposal https://github.com/dotnet/runtime/issues/59591#issuecomment-933059993 and one reply discussing it. Additional feedback and discussion is welcome on these APIs (or additional proposed ones) to keep this moving.

EamonNerbonne commented 4 days ago

In terms of API proposal: