Closed bjjones closed 4 years ago
The gain comes entirely from the optimized deflate() function.
Looking at the patch, it replaces a managed CRC version with a native CRC version in zlib. Surely this alone is responsible for a portion of the performance delta.
Perhaps a better name for this proposal is "Use native zlib CRC32 instead of managed implementation."
Looking at the patch
I think there's some confusion here, in that I don't think the PR and this issue are directly related.
My understanding is that in this issue @bjjones is proposing ditching clrcompression.dll (the native zlib binary that System.IO.Compression P/Invokes to on Windows) and replacing it with a different zlib implementation, one optimized by Intel. His performance results are based on comparing that newer zlib with the current clrcompression.dll (which is also based on an older zlib source I believe).
Separate from that, his recent PR updates the crc32 calculation in the library to use the crc32 function from clrcompression.dll; that change is unrelated to this proposal, other than tangentially. Once updated to P/Invoke to native for the crc32 function, then any improvements that came from an improved native binary would show up as well.
I think there's some confusion here, in that I don't think the PR and this issue are directly related.
If this issue is about only replacing the code on Windows versions, what is the implication to the minimum system requirement that adding this Intel library will have?
The library seems to hard compile in support for PCLMULQDQ rather than detect it at runtime. https://github.com/jtkukunas/zlib/blob/e176b3c23ace88d5ded5b8f8371bbab6d7b02ba8/crc32.c#L443
This means that it will only work on Intel(R) Core processors from 2010 onwards and AMD processors from 2011 onwards. https://en.wikipedia.org/wiki/CLMUL_instruction_set
This also seems to disqualify the entire Intel(R) Atom line as they all (even the server SoC) seem to lack support for CLMUL. (Please correct this if its inaccurate).
Thanks for the response @stephentoub , that sums things up well.
@OtherCrashOverride Those defines are only to determine which optimizations are compiled in, not which are used at runtime. CPUIDs are used at runtime to determine what code should be run. This should not break any past platforms.
CPUIDs are used at runtime to determine what code should be run. This should not break any past platforms.
Glad to hear that. Why is the changeset not upstream at http://www.zlib.net/?
I'm unable to find a substantial difference between zlib-intel and zlib 1.2.8 when using either as the base for System.IO.Compression.dll. My results are significantly less interesting than snellmans.
System: Windows 10 Processor Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s)
Test:
public static IEnumerable<object[]> CanterburyCorpus()
{
foreach (CompressionLevel compressionLevel in Enum.GetValues(typeof(CompressionLevel)))
{
foreach (int innerIterations in new int[] { 1, 10, 25, 50 })
{
yield return new object[] { innerIterations, "alice29.txt", compressionLevel };
yield return new object[] { innerIterations, "asyoulik.txt", compressionLevel };
yield return new object[] { innerIterations, "cp.html", compressionLevel };
yield return new object[] { innerIterations, "fields.c", compressionLevel };
yield return new object[] { innerIterations, "grammar.lsp", compressionLevel };
yield return new object[] { innerIterations, "kennedy.xls", compressionLevel };
yield return new object[] { innerIterations, "lcet10.txt", compressionLevel };
yield return new object[] { innerIterations, "plrabn12.txt", compressionLevel };
yield return new object[] { innerIterations, "ptt5", compressionLevel };
yield return new object[] { innerIterations, "sum", compressionLevel };
yield return new object[] { innerIterations, "xargs.1", compressionLevel };
}
}
}
/// <summary>
/// Benchmark tests to measure the performance of individually compressing each file in the
/// Canterbury Corpus
/// </summary>
[Benchmark]
[MemberData("CanterburyCorpus")]
public void Compress_Canterbury(int innerIterations, string fileName, CompressionLevel compressLevel)
{
byte[] bytes = File.ReadAllBytes(Path.Combine("GZTestData", "Canterbury", fileName));
PerfUtils utils = new PerfUtils();
FileStream[] filestreams = new FileStream[innerIterations];
DeflateStream[] deflates = new DeflateStream[innerIterations];
string[] paths = new string[innerIterations];
foreach (var iteration in Benchmark.Iterations)
{
for (int i = 0; i < innerIterations; i++)
{
paths[i] = utils.GetTestFilePath();
filestreams[i] = File.Create(paths[i]);
}
using (iteration.StartMeasurement())
for (int i = 0; i < innerIterations; i++)
{
deflates[i] = new DeflateStream(filestreams[i], compressLevel);
deflates[i].Write(bytes, 0, bytes.Length);
deflates[i].Flush();
deflates[i].Dispose();
filestreams[i].Dispose();
}
for (int i = 0; i < innerIterations; i++)
File.Delete(paths[i]);
}
}
The results for 1 compress/iteration using CompressionLevel.Optimal (values are median times over several hundred iterations):
innerIterations | file name | CompressLevel | zlibadler | zlibintel | intel / adler |
---|---|---|---|---|---|
1 | alice29.txt | Optimal | 11.29693 | 12.44476 | 110.16% |
1 | asyoulik.txt | Optimal | 10.63492 | 10.9474 | 102.94% |
1 | cp.html | Optimal | 1.102486 | 1.174903 | 106.57% |
1 | fields.c | Optimal | 0.618095 | 0.615815 | 99.63% |
1 | grammar.lsp | Optimal | 0.394859 | 0.39771 | 100.72% |
1 | kennedy.xls | Optimal | 41.16637 | 41.59916 | 101.05% |
1 | lcet10.txt | Optimal | 26.82167 | 27.84035 | 103.80% |
1 | plrabn12.txt | Optimal | 41.40671 | 41.33401 | 99.82% |
1 | ptt5 | Optimal | 14.94227 | 13.89251 | 92.97% |
1 | sum | Optimal | 9.999712 | 9.599426 | 96.00% |
1 | xargs.1 | Optimal | 0.417953 | 0.39771 | 95.16% |
master
. For build consistency I used this as the baseline instead of our clrcompression that we ship via nuget.win32_nmake
Results for other innerIteration and CompressionLevels are very similar so I've left them out for brevity. I can upload them and the detailed per-iteration results if anyone is interested. On average we're near equality within a few percentage points:
innerIterations | CompressLevel | average intel/adler |
---|---|---|
1 | Optimal | 100.80% |
10 | Optimal | 95.16% |
25 | Optimal | 98.38% |
50 | Optimal | 99.92% |
1 | Fastest | 99.44% |
10 | Fastest | 100.08% |
25 | Fastest | 102.63% |
50 | Fastest | 96.55% |
1 | NoCompression | 98.12% |
10 | NoCompression | 86.19% |
25 | NoCompression | 103.42% |
50 | NoCompression | 101.86% |
@bjjones can you still see a significant improvement with zlib-intel over zlib-adler now that our codebase has been updated to use zlib's CRC32?
@ianhays
Those are interesting results. I've stayed current with the builds and I've seen the gains stay at +20-30% throughout, although I have been using the Calgary Corpus instead, as well as a set of images. If anything, using Zlib crc32 should increase the gains.
I'll take a look at your tests in the next couple days and see what I can reproduce.
Thanks @bjjones, I'm intrigued to hear your results; I was expecting a more substantial distinction between the zlibs and am wondering if the lack of such a distinction is perhaps due to an issue with my zlib dlls or the perf runner interfering with the runtime CPUID checks. If you get a chance, could you send me the zlib-intel dll you've been using for testing?
@ianhays I've created a comparable microbenchmark and saw noticeable speedup in all the scenarios posted. I tested on an Intel i5-4670.
innerIterations | filename | compressLevel | Adler | Intel | Intel / Adler |
---|---|---|---|---|---|
1 | alice29.txt | Optimal | 12.3288 | 9.0891 | 73.72% |
1 | asyoulik.txt | Optimal | 10.6926 | 7.3696 | 68.92% |
1 | cp.html | Optimal | 1.4408 | 1.2986 | 90.13% |
1 | fields.c | Optimal | 1.2047 | 1.0109 | 83.91% |
1 | grammar.lsp | Optimal | 0.8026 | 0.8032 | 100.07% |
1 | kennedy.xls | Optimal | 48.2053 | 18.928 | 39.27% |
1 | lcet10.txt | Optimal | 27.0928 | 18.4792 | 68.21% |
1 | plrabn12.txt | Optimal | 39.7202 | 26.7565 | 67.36% |
1 | ptt5 | Optimal | 14.5243 | 10.1791 | 70.08% |
1 | sum | Optimal | 10.024 | 3.767 | 37.58% |
1 | xargs.1 | Optimal | 0.9089 | 1.0233 | 112.59% |
innerIterations | CompressLevel | Average Intel / Adler |
---|---|---|
1 | Optimal | 73.8% |
10 | Optimal | 70.62% |
25 | Optimal | 72.33% |
50 | Optimal | 71.51% |
1 | Fastest | 84.47% |
10 | Fastest | 73.85% |
25 | Fastest | 71.97% |
50 | Fastest | 80.21% |
1 | NoCompression | 98.10% |
10 | NoCompression | 94.67% |
25 | NoCompression | 94.83% |
50 | NoCompression | 93.29% |
A copy of the clrcompression.dll I've been using is hosted here: https://www.dropbox.com/s/uybmpqfgr4svzpl/clrcompression-intel.dll?dl=0
Please let me know if there's anything I can do to help get you up and running with this. The build method you posted was exactly how I did it, but there may be differences in our runtime environments that are being overlooked.
Thanks for the help, @bjjones. There was an issue with the perf runner that was causing the local test folders copy to be ignored. I'm guessing that it was pulling clrcompression.dll from the dnx folder like you suggested it might be; I wasn't aware the perf runner was based off of DNX for Windows.
In any case, I modified my test to be a console app and am now noticing the significant improvements in zlib-intel.
Results with times in ticks:
innerIterations | File | Compression Level | Adler | Intel | Intel / Adler |
---|---|---|---|---|---|
25 | alice29.txt | Optimal | 1048320.00 | 733476 | 69.97% |
25 | asyoulik.txt | Optimal | 976800.00 | 671702 | 68.77% |
25 | cp.html | Optimal | 99704.00 | 87667 | 87.93% |
25 | fields.c | Optimal | 57052.00 | 52493 | 92.01% |
25 | grammar.lsp | Optimal | 41567.00 | 38504 | 92.63% |
25 | kennedy.xls | Optimal | 3823359.00 | 1735229 | 45.38% |
25 | lcet10.txt | Optimal | 2446221.00 | 1658126 | 67.78% |
25 | plrabn12.txt | Optimal | 3543454.00 | 2510901 | 70.86% |
25 | ptt5 | Optimal | 1277376.00 | 846599 | 66.28% |
25 | sum | Optimal | 963350.00 | 308337 | 32.01% |
25 | xargs.1 | Optimal | 46566.00 | 48470 | 104.09% |
innerIterations | CompressLevel | Average Intel / Adler |
---|---|---|
1 | Optimal | 75.69% |
10 | Optimal | 72.48% |
25 | Optimal | 72.52% |
1 | Fastest | 100.21% |
10 | Fastest | 87.13% |
25 | Fastest | 81.89% |
1 | NoCompression | 99.69% |
10 | NoCompression | 100.72% |
25 | NoCompression | 104.75% |
I'm looking forward to your PR to bring these improvements to clrcompression :)
Summary Intel has developed an optimized version of Zlib that includes architecture specific optimizations and algorithmic changes to improve the included deflate() and crc32() functions. Since System.IO.Compression calls out to these functions through clrcompression.dll, I suggest that Zlib-Intel to be used to build clrcompression.dll.
Test Results I have compiled the library myself, and by renaming the .dll to clrcompression.dll and replacing the clrcompression included with .NET Core, I see an improvement of 20-30% when doing compression using GZipStream on an Intel® 5th Generation Core (tm) i5 Processor based system, as well as on an Intel® Xeon® Processor E5 v2 Family based server system. The gain comes entirely from the optimized deflate() function.
Link The developer of Zlib-Intel has provided support for VS Intrinsics here: https://github.com/jtkukunas/zlib/tree/win32_nmake