Use Optimized Zlib-Intel to Build clrcompression.dll

bjjones commented 9 years ago

Summary Intel has developed an optimized version of Zlib that includes architecture specific optimizations and algorithmic changes to improve the included deflate() and crc32() functions. Since System.IO.Compression calls out to these functions through clrcompression.dll, I suggest that Zlib-Intel to be used to build clrcompression.dll.

Test Results I have compiled the library myself, and by renaming the .dll to clrcompression.dll and replacing the clrcompression included with .NET Core, I see an improvement of 20-30% when doing compression using GZipStream on an Intel® 5th Generation Core (tm) i5 Processor based system, as well as on an Intel® Xeon® Processor E5 v2 Family based server system. The gain comes entirely from the optimized deflate() function.

Link The developer of Zlib-Intel has provided support for VS Intrinsics here: https://github.com/jtkukunas/zlib/tree/win32_nmake

OtherCrashOverride commented 9 years ago

The gain comes entirely from the optimized deflate() function.

Looking at the patch, it replaces a managed CRC version with a native CRC version in zlib. Surely this alone is responsible for a portion of the performance delta.

Perhaps a better name for this proposal is "Use native zlib CRC32 instead of managed implementation."

stephentoub commented 9 years ago

Looking at the patch

I think there's some confusion here, in that I don't think the PR and this issue are directly related.

My understanding is that in this issue @bjjones is proposing ditching clrcompression.dll (the native zlib binary that System.IO.Compression P/Invokes to on Windows) and replacing it with a different zlib implementation, one optimized by Intel. His performance results are based on comparing that newer zlib with the current clrcompression.dll (which is also based on an older zlib source I believe).

Separate from that, his recent PR updates the crc32 calculation in the library to use the crc32 function from clrcompression.dll; that change is unrelated to this proposal, other than tangentially. Once updated to P/Invoke to native for the crc32 function, then any improvements that came from an improved native binary would show up as well.

OtherCrashOverride commented 9 years ago

I think there's some confusion here, in that I don't think the PR and this issue are directly related.

If this issue is about only replacing the code on Windows versions, what is the implication to the minimum system requirement that adding this Intel library will have?

The library seems to hard compile in support for PCLMULQDQ rather than detect it at runtime. https://github.com/jtkukunas/zlib/blob/e176b3c23ace88d5ded5b8f8371bbab6d7b02ba8/crc32.c#L443

This means that it will only work on Intel(R) Core processors from 2010 onwards and AMD processors from 2011 onwards. https://en.wikipedia.org/wiki/CLMUL_instruction_set

OtherCrashOverride commented 9 years ago

This also seems to disqualify the entire Intel(R) Atom line as they all (even the server SoC) seem to lack support for CLMUL. (Please correct this if its inaccurate).

bjjones commented 9 years ago

Thanks for the response @stephentoub , that sums things up well.

@OtherCrashOverride Those defines are only to determine which optimizations are compiled in, not which are used at runtime. CPUIDs are used at runtime to determine what code should be run. This should not break any past platforms.

OtherCrashOverride commented 9 years ago

CPUIDs are used at runtime to determine what code should be run. This should not break any past platforms.

Glad to hear that. Why is the changeset not upstream at http://www.zlib.net/?

ianhays commented 8 years ago

I'm unable to find a substantial difference between zlib-intel and zlib 1.2.8 when using either as the base for System.IO.Compression.dll. My results are significantly less interesting than snellmans.

System: Windows 10 Processor Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s)

Test:

public static IEnumerable<object[]> CanterburyCorpus()
{
    foreach (CompressionLevel compressionLevel in Enum.GetValues(typeof(CompressionLevel)))
    {
        foreach (int innerIterations in new int[] { 1, 10, 25, 50 })
        {
            yield return new object[] { innerIterations, "alice29.txt", compressionLevel };
            yield return new object[] { innerIterations, "asyoulik.txt", compressionLevel };
            yield return new object[] { innerIterations, "cp.html", compressionLevel };
            yield return new object[] { innerIterations, "fields.c", compressionLevel };
            yield return new object[] { innerIterations, "grammar.lsp", compressionLevel };
            yield return new object[] { innerIterations, "kennedy.xls", compressionLevel };
            yield return new object[] { innerIterations, "lcet10.txt", compressionLevel };
            yield return new object[] { innerIterations, "plrabn12.txt", compressionLevel };
            yield return new object[] { innerIterations, "ptt5", compressionLevel };
            yield return new object[] { innerIterations, "sum", compressionLevel };
            yield return new object[] { innerIterations, "xargs.1", compressionLevel };
        }
    }
}

/// <summary>
/// Benchmark tests to measure the performance of individually compressing each file in the
/// Canterbury Corpus
/// </summary>
[Benchmark]
[MemberData("CanterburyCorpus")]
public void Compress_Canterbury(int innerIterations, string fileName, CompressionLevel compressLevel)
{
    byte[] bytes = File.ReadAllBytes(Path.Combine("GZTestData", "Canterbury", fileName));
    PerfUtils utils = new PerfUtils();
    FileStream[] filestreams = new FileStream[innerIterations];
    DeflateStream[] deflates = new DeflateStream[innerIterations];
    string[] paths = new string[innerIterations];
    foreach (var iteration in Benchmark.Iterations)
    {
        for (int i = 0; i < innerIterations; i++)
        {
            paths[i] = utils.GetTestFilePath();
            filestreams[i] = File.Create(paths[i]);
        }
        using (iteration.StartMeasurement())
            for (int i = 0; i < innerIterations; i++)
            {
                deflates[i] = new DeflateStream(filestreams[i], compressLevel);
                deflates[i].Write(bytes, 0, bytes.Length);
                deflates[i].Flush();
                deflates[i].Dispose();
                filestreams[i].Dispose();
            }
        for (int i = 0; i < innerIterations; i++)
            File.Delete(paths[i]);
    }
}

The results for 1 compress/iteration using CompressionLevel.Optimal (values are median times over several hundred iterations):

innerIterations	file name	CompressLevel	zlibadler	zlibintel	intel / adler
1	alice29.txt	Optimal	11.29693	12.44476	110.16%
1	asyoulik.txt	Optimal	10.63492	10.9474	102.94%
1	cp.html	Optimal	1.102486	1.174903	106.57%
1	fields.c	Optimal	0.618095	0.615815	99.63%
1	grammar.lsp	Optimal	0.394859	0.39771	100.72%
1	kennedy.xls	Optimal	41.16637	41.59916	101.05%
1	lcet10.txt	Optimal	26.82167	27.84035	103.80%
1	plrabn12.txt	Optimal	41.40671	41.33401	99.82%
1	ptt5	Optimal	14.94227	13.89251	92.97%
1	sum	Optimal	9.999712	9.599426	96.00%
1	xargs.1	Optimal	0.417953	0.39771	95.16%

zlibadler is the zlib1.dll produced by running nmake -f win32/Makefile.msc from within a clone of madler/zlib on branch master. For build consistency I used this as the baseline instead of our clrcompression that we ship via nuget.
zlibintel is the zlib1.dll produced by running nmake -f win32/Makefile.msc from within a clone of jtkukunas/zlib on branch win32_nmake

Results for other innerIteration and CompressionLevels are very similar so I've left them out for brevity. I can upload them and the detailed per-iteration results if anyone is interested. On average we're near equality within a few percentage points:

innerIterations	CompressLevel	average intel/adler
1	Optimal	100.80%
10	Optimal	95.16%
25	Optimal	98.38%
50	Optimal	99.92%
1	Fastest	99.44%
10	Fastest	100.08%
25	Fastest	102.63%
50	Fastest	96.55%
1	NoCompression	98.12%
10	NoCompression	86.19%
25	NoCompression	103.42%
50	NoCompression	101.86%

@bjjones can you still see a significant improvement with zlib-intel over zlib-adler now that our codebase has been updated to use zlib's CRC32?

bjjones commented 8 years ago

@ianhays

Those are interesting results. I've stayed current with the builds and I've seen the gains stay at +20-30% throughout, although I have been using the Calgary Corpus instead, as well as a set of images. If anything, using Zlib crc32 should increase the gains.

I'll take a look at your tests in the next couple days and see what I can reproduce.

ianhays commented 8 years ago

Thanks @bjjones, I'm intrigued to hear your results; I was expecting a more substantial distinction between the zlibs and am wondering if the lack of such a distinction is perhaps due to an issue with my zlib dlls or the perf runner interfering with the runtime CPUID checks. If you get a chance, could you send me the zlib-intel dll you've been using for testing?

bjjones commented 8 years ago

@ianhays I've created a comparable microbenchmark and saw noticeable speedup in all the scenarios posted. I tested on an Intel i5-4670.

innerIterations	filename	compressLevel	Adler	Intel	Intel / Adler
1	alice29.txt	Optimal	12.3288	9.0891	73.72%
1	asyoulik.txt	Optimal	10.6926	7.3696	68.92%
1	cp.html	Optimal	1.4408	1.2986	90.13%
1	fields.c	Optimal	1.2047	1.0109	83.91%
1	grammar.lsp	Optimal	0.8026	0.8032	100.07%
1	kennedy.xls	Optimal	48.2053	18.928	39.27%
1	lcet10.txt	Optimal	27.0928	18.4792	68.21%
1	plrabn12.txt	Optimal	39.7202	26.7565	67.36%
1	ptt5	Optimal	14.5243	10.1791	70.08%
1	sum	Optimal	10.024	3.767	37.58%
1	xargs.1	Optimal	0.9089	1.0233	112.59%

innerIterations	CompressLevel	Average Intel / Adler
1	Optimal	73.8%
10	Optimal	70.62%
25	Optimal	72.33%
50	Optimal	71.51%
1	Fastest	84.47%
10	Fastest	73.85%
25	Fastest	71.97%
50	Fastest	80.21%
1	NoCompression	98.10%
10	NoCompression	94.67%
25	NoCompression	94.83%
50	NoCompression	93.29%

A copy of the clrcompression.dll I've been using is hosted here: https://www.dropbox.com/s/uybmpqfgr4svzpl/clrcompression-intel.dll?dl=0

Please let me know if there's anything I can do to help get you up and running with this. The build method you posted was exactly how I did it, but there may be differences in our runtime environments that are being overlooked.

ianhays commented 8 years ago

Thanks for the help, @bjjones. There was an issue with the perf runner that was causing the local test folders copy to be ignored. I'm guessing that it was pulling clrcompression.dll from the dnx folder like you suggested it might be; I wasn't aware the perf runner was based off of DNX for Windows.

In any case, I modified my test to be a console app and am now noticing the significant improvements in zlib-intel.

Results with times in ticks:

innerIterations	File	Compression Level	Adler	Intel	Intel / Adler
25	alice29.txt	Optimal	1048320.00	733476	69.97%
25	asyoulik.txt	Optimal	976800.00	671702	68.77%
25	cp.html	Optimal	99704.00	87667	87.93%
25	fields.c	Optimal	57052.00	52493	92.01%
25	grammar.lsp	Optimal	41567.00	38504	92.63%
25	kennedy.xls	Optimal	3823359.00	1735229	45.38%
25	lcet10.txt	Optimal	2446221.00	1658126	67.78%
25	plrabn12.txt	Optimal	3543454.00	2510901	70.86%
25	ptt5	Optimal	1277376.00	846599	66.28%
25	sum	Optimal	963350.00	308337	32.01%
25	xargs.1	Optimal	46566.00	48470	104.09%

innerIterations	CompressLevel	Average Intel / Adler
1	Optimal	75.69%
10	Optimal	72.48%
25	Optimal	72.52%

1	Fastest	100.21%
10	Fastest	87.13%
25	Fastest	81.89%

1	NoCompression	99.69%
10	NoCompression	100.72%
25	NoCompression	104.75%

I'm looking forward to your PR to bring these improvements to clrcompression :)

dotnet / runtime

Use Optimized Zlib-Intel to Build clrcompression.dll #15496