dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

System.IO.Compression.Crc32Helper should use Crc32 intrinsics #40244

Open benaadams opened 4 years ago

benaadams commented 4 years ago

Rather than interoping to zlib when the Crc32 intrinsics are available on Arm and x86/x64

https://github.com/dotnet/runtime/blob/995224db011f77eb095279122244704ccca01d5f/src/libraries/System.IO.Compression/src/System/IO/Compression/Crc32Helper.ZLib.cs#L11-L26

/cc @tannergooding

danmoseley commented 4 years ago

According to https://github.com/dotnet/runtime/issues/2036, the intrinsics use different polynomials. If I understand right, zlib uses 0x04C11DB7/0xEDB88320 and SSE uses 0x1EDC6F41. Wikipedia has a large table.

Is that right?

benaadams commented 4 years ago

Sigh might be; I think you might have to use PCLMULQDQ instead

JimBobSquarePants commented 4 years ago

Tada! Take what you want and feed back any improvements you can make.

https://github.com/SixLabors/ImageSharp/blob/f4f689ce67ecbcc35cebddba5aacb603e6d1068a/src/ImageSharp/Formats/Png/Zlib/Crc32.cs

brantburnett commented 1 year ago

I believe that this can be done now using the vectorized implementation now found in System.IO.Hashing (it uses PCLMULQDQ on Intel and equivalents on ARM).

My concern is that System.IO.Compression appears to be an always included part of the framework, while System.IO.Hashing is distributed on NuGet. We'd need to make System.IO.Hashing always available. I'm not sure what the procedures would be for getting such a change approved, or if it's worth it.

stephentoub commented 1 year ago

Yeah, I don't think we're going to want to pull NonCryptographicHashAlgorithm down into corelib. We could consider alternate ways of sharing the code / implementation with System.IO.Compression, though. Note that the checksum is only used there as part of ZipArchive when writing out an entry. It is, however, a meaningful chunk of the time required to do so, something around 30% for typical use if memory serves.

brantburnett commented 1 year ago

I did some testing to see if this improvement might be worthwhile. Here are the results I'm seeing on my Intel Windows machine. The test was compressing a single file to a ZipArchive using a similar approach to the other compression tests in the microbenchmarks. I may be able to squeeze out a bit more by using the static Update method if it's accessible by internalizing Crc32.

The biggest risk would be old Intel CPUs that don't support the intrinsics required for vectorization. I would suspect that the zlib scalar implementation may be better for those cases. The upside is we avoid a GC pin and transition and gain vectorization when available.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1635) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.3.23178.7 [Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2 Job-DUHTZA : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-MWSTQA : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job file Mean Error StdDev Median Min Max Ratio RatioSD
Compress Interop TestDocument.pdf 3,102.3 us 29.86 us 27.93 us 3,106.4 us 3,060.1 us 3,170.0 us 1.00 0.00
Compress C# TestDocument.pdf 3,046.6 us 21.40 us 20.02 us 3,045.1 us 3,016.4 us 3,080.2 us 0.98 0.01
Compress Interop alice29.txt 3,978.3 us 24.40 us 22.82 us 3,971.5 us 3,950.2 us 4,026.7 us 1.00 0.00
Compress C# alice29.txt 3,953.8 us 33.50 us 27.98 us 3,943.8 us 3,909.5 us 4,015.5 us 0.99 0.01
Compress Interop sum 662.0 us 11.95 us 11.18 us 658.1 us 650.2 us 684.0 us 1.00 0.00
Compress C# sum 639.7 us 2.26 us 1.77 us 640.3 us 637.1 us 642.2 us 0.96 0.02
stephentoub commented 1 year ago

Thanks, @brantburnett. Just a few percent in throughput isn't worth bringing this functionality down from System.IO.Hashing nor including a large amount of code in System.IO.Compression. Based on these numbers, I'd speculate that the zlib being used already does some amount of vectorization in its crc32 calculation, e.g. the zlib from Intel that .NET uses on Windows does: https://github.com/dotnet/runtime/blob/main/src/native/external/zlib-intel/crc_folding.c