dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.19k stars 4.72k forks source link

Use Optimized Zlib-Intel to Build clrcompression.dll #15496

Closed bjjones closed 4 years ago

bjjones commented 9 years ago

Summary Intel has developed an optimized version of Zlib that includes architecture specific optimizations and algorithmic changes to improve the included deflate() and crc32() functions. Since System.IO.Compression calls out to these functions through clrcompression.dll, I suggest that Zlib-Intel to be used to build clrcompression.dll.

Test Results I have compiled the library myself, and by renaming the .dll to clrcompression.dll and replacing the clrcompression included with .NET Core, I see an improvement of 20-30% when doing compression using GZipStream on an Intel® 5th Generation Core (tm) i5 Processor based system, as well as on an Intel® Xeon® Processor E5 v2 Family based server system. The gain comes entirely from the optimized deflate() function.

Link The developer of Zlib-Intel has provided support for VS Intrinsics here: https://github.com/jtkukunas/zlib/tree/win32_nmake

OtherCrashOverride commented 9 years ago

The gain comes entirely from the optimized deflate() function.

Looking at the patch, it replaces a managed CRC version with a native CRC version in zlib. Surely this alone is responsible for a portion of the performance delta.

Perhaps a better name for this proposal is "Use native zlib CRC32 instead of managed implementation."

stephentoub commented 9 years ago

Looking at the patch

I think there's some confusion here, in that I don't think the PR and this issue are directly related.

My understanding is that in this issue @bjjones is proposing ditching clrcompression.dll (the native zlib binary that System.IO.Compression P/Invokes to on Windows) and replacing it with a different zlib implementation, one optimized by Intel. His performance results are based on comparing that newer zlib with the current clrcompression.dll (which is also based on an older zlib source I believe).

Separate from that, his recent PR updates the crc32 calculation in the library to use the crc32 function from clrcompression.dll; that change is unrelated to this proposal, other than tangentially. Once updated to P/Invoke to native for the crc32 function, then any improvements that came from an improved native binary would show up as well.

OtherCrashOverride commented 9 years ago

I think there's some confusion here, in that I don't think the PR and this issue are directly related.

If this issue is about only replacing the code on Windows versions, what is the implication to the minimum system requirement that adding this Intel library will have?

The library seems to hard compile in support for PCLMULQDQ rather than detect it at runtime. https://github.com/jtkukunas/zlib/blob/e176b3c23ace88d5ded5b8f8371bbab6d7b02ba8/crc32.c#L443

This means that it will only work on Intel(R) Core processors from 2010 onwards and AMD processors from 2011 onwards. https://en.wikipedia.org/wiki/CLMUL_instruction_set

OtherCrashOverride commented 9 years ago

This also seems to disqualify the entire Intel(R) Atom line as they all (even the server SoC) seem to lack support for CLMUL. (Please correct this if its inaccurate).

bjjones commented 9 years ago

Thanks for the response @stephentoub , that sums things up well.

@OtherCrashOverride Those defines are only to determine which optimizations are compiled in, not which are used at runtime. CPUIDs are used at runtime to determine what code should be run. This should not break any past platforms.

OtherCrashOverride commented 9 years ago

CPUIDs are used at runtime to determine what code should be run. This should not break any past platforms.

Glad to hear that. Why is the changeset not upstream at http://www.zlib.net/?

ianhays commented 8 years ago

I'm unable to find a substantial difference between zlib-intel and zlib 1.2.8 when using either as the base for System.IO.Compression.dll. My results are significantly less interesting than snellmans.

System: Windows 10 Processor Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s)

Test:

public static IEnumerable<object[]> CanterburyCorpus()
{
    foreach (CompressionLevel compressionLevel in Enum.GetValues(typeof(CompressionLevel)))
    {
        foreach (int innerIterations in new int[] { 1, 10, 25, 50 })
        {
            yield return new object[] { innerIterations, "alice29.txt", compressionLevel };
            yield return new object[] { innerIterations, "asyoulik.txt", compressionLevel };
            yield return new object[] { innerIterations, "cp.html", compressionLevel };
            yield return new object[] { innerIterations, "fields.c", compressionLevel };
            yield return new object[] { innerIterations, "grammar.lsp", compressionLevel };
            yield return new object[] { innerIterations, "kennedy.xls", compressionLevel };
            yield return new object[] { innerIterations, "lcet10.txt", compressionLevel };
            yield return new object[] { innerIterations, "plrabn12.txt", compressionLevel };
            yield return new object[] { innerIterations, "ptt5", compressionLevel };
            yield return new object[] { innerIterations, "sum", compressionLevel };
            yield return new object[] { innerIterations, "xargs.1", compressionLevel };
        }
    }
}

/// <summary>
/// Benchmark tests to measure the performance of individually compressing each file in the
/// Canterbury Corpus
/// </summary>
[Benchmark]
[MemberData("CanterburyCorpus")]
public void Compress_Canterbury(int innerIterations, string fileName, CompressionLevel compressLevel)
{
    byte[] bytes = File.ReadAllBytes(Path.Combine("GZTestData", "Canterbury", fileName));
    PerfUtils utils = new PerfUtils();
    FileStream[] filestreams = new FileStream[innerIterations];
    DeflateStream[] deflates = new DeflateStream[innerIterations];
    string[] paths = new string[innerIterations];
    foreach (var iteration in Benchmark.Iterations)
    {
        for (int i = 0; i < innerIterations; i++)
        {
            paths[i] = utils.GetTestFilePath();
            filestreams[i] = File.Create(paths[i]);
        }
        using (iteration.StartMeasurement())
            for (int i = 0; i < innerIterations; i++)
            {
                deflates[i] = new DeflateStream(filestreams[i], compressLevel);
                deflates[i].Write(bytes, 0, bytes.Length);
                deflates[i].Flush();
                deflates[i].Dispose();
                filestreams[i].Dispose();
            }
        for (int i = 0; i < innerIterations; i++)
            File.Delete(paths[i]);
    }
}

The results for 1 compress/iteration using CompressionLevel.Optimal (values are median times over several hundred iterations):

innerIterations file name CompressLevel zlibadler zlibintel intel / adler
1 alice29.txt Optimal 11.29693 12.44476 110.16%
1 asyoulik.txt Optimal 10.63492 10.9474 102.94%
1 cp.html Optimal 1.102486 1.174903 106.57%
1 fields.c Optimal 0.618095 0.615815 99.63%
1 grammar.lsp Optimal 0.394859 0.39771 100.72%
1 kennedy.xls Optimal 41.16637 41.59916 101.05%
1 lcet10.txt Optimal 26.82167 27.84035 103.80%
1 plrabn12.txt Optimal 41.40671 41.33401 99.82%
1 ptt5 Optimal 14.94227 13.89251 92.97%
1 sum Optimal 9.999712 9.599426 96.00%
1 xargs.1 Optimal 0.417953 0.39771 95.16%

Results for other innerIteration and CompressionLevels are very similar so I've left them out for brevity. I can upload them and the detailed per-iteration results if anyone is interested. On average we're near equality within a few percentage points:

innerIterations CompressLevel average intel/adler
1 Optimal 100.80%
10 Optimal 95.16%
25 Optimal 98.38%
50 Optimal 99.92%
1 Fastest 99.44%
10 Fastest 100.08%
25 Fastest 102.63%
50 Fastest 96.55%
1 NoCompression 98.12%
10 NoCompression 86.19%
25 NoCompression 103.42%
50 NoCompression 101.86%

@bjjones can you still see a significant improvement with zlib-intel over zlib-adler now that our codebase has been updated to use zlib's CRC32?

bjjones commented 8 years ago

@ianhays

Those are interesting results. I've stayed current with the builds and I've seen the gains stay at +20-30% throughout, although I have been using the Calgary Corpus instead, as well as a set of images. If anything, using Zlib crc32 should increase the gains.

I'll take a look at your tests in the next couple days and see what I can reproduce.

ianhays commented 8 years ago

Thanks @bjjones, I'm intrigued to hear your results; I was expecting a more substantial distinction between the zlibs and am wondering if the lack of such a distinction is perhaps due to an issue with my zlib dlls or the perf runner interfering with the runtime CPUID checks. If you get a chance, could you send me the zlib-intel dll you've been using for testing?

bjjones commented 8 years ago

@ianhays I've created a comparable microbenchmark and saw noticeable speedup in all the scenarios posted. I tested on an Intel i5-4670.

innerIterations filename compressLevel Adler Intel Intel / Adler
1 alice29.txt Optimal 12.3288 9.0891 73.72%
1 asyoulik.txt Optimal 10.6926 7.3696 68.92%
1 cp.html Optimal 1.4408 1.2986 90.13%
1 fields.c Optimal 1.2047 1.0109 83.91%
1 grammar.lsp Optimal 0.8026 0.8032 100.07%
1 kennedy.xls Optimal 48.2053 18.928 39.27%
1 lcet10.txt Optimal 27.0928 18.4792 68.21%
1 plrabn12.txt Optimal 39.7202 26.7565 67.36%
1 ptt5 Optimal 14.5243 10.1791 70.08%
1 sum Optimal 10.024 3.767 37.58%
1 xargs.1 Optimal 0.9089 1.0233 112.59%
innerIterations CompressLevel Average Intel / Adler
1 Optimal 73.8%
10 Optimal 70.62%
25 Optimal 72.33%
50 Optimal 71.51%
1 Fastest 84.47%
10 Fastest 73.85%
25 Fastest 71.97%
50 Fastest 80.21%
1 NoCompression 98.10%
10 NoCompression 94.67%
25 NoCompression 94.83%
50 NoCompression 93.29%

A copy of the clrcompression.dll I've been using is hosted here: https://www.dropbox.com/s/uybmpqfgr4svzpl/clrcompression-intel.dll?dl=0

Please let me know if there's anything I can do to help get you up and running with this. The build method you posted was exactly how I did it, but there may be differences in our runtime environments that are being overlooked.

ianhays commented 8 years ago

Thanks for the help, @bjjones. There was an issue with the perf runner that was causing the local test folders copy to be ignored. I'm guessing that it was pulling clrcompression.dll from the dnx folder like you suggested it might be; I wasn't aware the perf runner was based off of DNX for Windows.

In any case, I modified my test to be a console app and am now noticing the significant improvements in zlib-intel.

Results with times in ticks:

innerIterations File Compression Level Adler Intel Intel / Adler
25 alice29.txt Optimal 1048320.00 733476 69.97%
25 asyoulik.txt Optimal 976800.00 671702 68.77%
25 cp.html Optimal 99704.00 87667 87.93%
25 fields.c Optimal 57052.00 52493 92.01%
25 grammar.lsp Optimal 41567.00 38504 92.63%
25 kennedy.xls Optimal 3823359.00 1735229 45.38%
25 lcet10.txt Optimal 2446221.00 1658126 67.78%
25 plrabn12.txt Optimal 3543454.00 2510901 70.86%
25 ptt5 Optimal 1277376.00 846599 66.28%
25 sum Optimal 963350.00 308337 32.01%
25 xargs.1 Optimal 46566.00 48470 104.09%
innerIterations CompressLevel Average Intel / Adler
1 Optimal 75.69%
10 Optimal 72.48%
25 Optimal 72.52%
1 Fastest 100.21%
10 Fastest 87.13%
25 Fastest 81.89%
1 NoCompression 99.69%
10 NoCompression 100.72%
25 NoCompression 104.75%

I'm looking forward to your PR to bring these improvements to clrcompression :)