Open benaadams opened 4 years ago
According to https://github.com/dotnet/runtime/issues/2036, the intrinsics use different polynomials. If I understand right, zlib uses 0x04C11DB7/0xEDB88320 and SSE uses 0x1EDC6F41. Wikipedia has a large table.
Is that right?
Sigh might be; I think you might have to use PCLMULQDQ instead
Tada! Take what you want and feed back any improvements you can make.
I believe that this can be done now using the vectorized implementation now found in System.IO.Hashing (it uses PCLMULQDQ on Intel and equivalents on ARM).
My concern is that System.IO.Compression appears to be an always included part of the framework, while System.IO.Hashing is distributed on NuGet. We'd need to make System.IO.Hashing always available. I'm not sure what the procedures would be for getting such a change approved, or if it's worth it.
Yeah, I don't think we're going to want to pull NonCryptographicHashAlgorithm down into corelib. We could consider alternate ways of sharing the code / implementation with System.IO.Compression, though. Note that the checksum is only used there as part of ZipArchive when writing out an entry. It is, however, a meaningful chunk of the time required to do so, something around 30% for typical use if memory serves.
I did some testing to see if this improvement might be worthwhile. Here are the results I'm seeing on my Intel Windows machine. The test was compressing a single file to a ZipArchive using a similar approach to the other compression tests in the microbenchmarks. I may be able to squeeze out a bit more by using the static Update method if it's accessible by internalizing Crc32.
The biggest risk would be old Intel CPUs that don't support the intrinsics required for vectorization. I would suspect that the zlib scalar implementation may be better for those cases. The upside is we avoid a GC pin and transition and gain vectorization when available.
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1635) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.3.23178.7 [Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2 Job-DUHTZA : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-MWSTQA : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Job | file | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD |
---|---|---|---|---|---|---|---|---|---|---|
Compress | Interop | TestDocument.pdf | 3,102.3 us | 29.86 us | 27.93 us | 3,106.4 us | 3,060.1 us | 3,170.0 us | 1.00 | 0.00 |
Compress | C# | TestDocument.pdf | 3,046.6 us | 21.40 us | 20.02 us | 3,045.1 us | 3,016.4 us | 3,080.2 us | 0.98 | 0.01 |
Compress | Interop | alice29.txt | 3,978.3 us | 24.40 us | 22.82 us | 3,971.5 us | 3,950.2 us | 4,026.7 us | 1.00 | 0.00 |
Compress | C# | alice29.txt | 3,953.8 us | 33.50 us | 27.98 us | 3,943.8 us | 3,909.5 us | 4,015.5 us | 0.99 | 0.01 |
Compress | Interop | sum | 662.0 us | 11.95 us | 11.18 us | 658.1 us | 650.2 us | 684.0 us | 1.00 | 0.00 |
Compress | C# | sum | 639.7 us | 2.26 us | 1.77 us | 640.3 us | 637.1 us | 642.2 us | 0.96 | 0.02 |
Thanks, @brantburnett. Just a few percent in throughput isn't worth bringing this functionality down from System.IO.Hashing nor including a large amount of code in System.IO.Compression. Based on these numbers, I'd speculate that the zlib being used already does some amount of vectorization in its crc32 calculation, e.g. the zlib from Intel that .NET uses on Windows does: https://github.com/dotnet/runtime/blob/main/src/native/external/zlib-intel/crc_folding.c
Rather than interoping to zlib when the Crc32 intrinsics are available on Arm and x86/x64
https://github.com/dotnet/runtime/blob/995224db011f77eb095279122244704ccca01d5f/src/libraries/System.IO.Compression/src/System/IO/Compression/Crc32Helper.ZLib.cs#L11-L26
/cc @tannergooding