DeflateStream output differs between .NET Framework and .NET Core

tmat commented 5 years ago

The compression result differs between .NET Framework and .NET Core. Is it expected? Are there any knobs that can be used to remove the differences?

The following program prints different output:

var text = "\r\nusing System;\r\n\r\nclass C\r\n{\r\n    public static void Main()\r\n    {\r\n        Console.WriteLine();\r\n    }\r\n}\r\n";
var compressedStream = new MemoryStream();

using (var deflater = new DeflateStream(compressedStream, CompressionLevel.Optimal, leaveOpen: true))
using (var writer = new StreamWriter(deflater, Encoding.UTF8, bufferSize: 1024, leaveOpen: true))
{
    writer.Write(text);
}

compressedStream.Position = 0;
Console.WriteLine(BitConverter.ToString(compressedStream.ToArray()));

Full Framework:

7B-BF-7B-FF-FB-DD-FB-79-B9-4A-8B-33-F3-D2-15-82-2B-8B-4B-52-73-AD-79-B9-78-B9-92-73-12-
8B-8B-15-9C-79-B9-AA-79-B9-14-80-A0-A0-34-29-27-33-59-A1-B8-24-B1-04-48-95-E5-67-A6-28-
F8-26-66-E6-69-68-42-A4-A1-AA-40-C0-39-3F-AF-38-3F-27-55-2F-BC-28-B3-24-D5-27-33-2F-55-
43-D3-1A-22-5B-CB-CB-05-44-00

Core:

7A-BF-7B-FF-FB-DD-FB-79-B9-4A-8B-33-F3-D2-15-82-2B-8B-4B-52-73-AD-79-B9-78-B9-92-73-12-
8B-8B-15-9C-79-B9-AA-79-B9-14-80-A0-A0-34-29-27-33-59-A1-B8-24-B1-04-48-95-E5-67-A6-28-
F8-26-66-E6-69-68-42-A4-A1-AA-40-C0-39-3F-AF-38-3F-27-55-2F-BC-28-B3-24-D5-27-33-2F-55-
43-13-68-22-48-A6-96-97-0B-88-00-00-00-00-FF-FF-03-00

Both blobs decompress back to the same value. But their compressed lengths are different.

stephentoub commented 5 years ago

Is it expected?

Yes. The implementations are almost entirely different. .NET Framework had a managed implementation, .NET Core uses zlib, and any changes/improvements in the underlying algorithms/settings can influence the compression ratios/outputs/etc.

Are there any knobs that can be used to remove the differences?

No.

Can you elaborate on the concern?

tmat commented 5 years ago

Also it appears there is a difference between Windows Core CLR and Linux Core CLR :(.

The concern is that the compiler output is different for the same inputs.

stephentoub commented 5 years ago

Also it appears there is a difference between Windows Core CLR and Linux Core CLR :(.

Quite possibly. On Linux and macOS we use whatever zlib is installed on the box. If it changes its compression from version to version, that will show up in use of DeflateStream.

tmat commented 5 years ago

Now this is more of a concern. You're basically saying that the behavior of the compression algorithm is non-deterministic/unspecified? I'd expect it to produce the same output based on the same input parameters.

tmat commented 5 years ago

(regardless of the implementation - different implementations might be faster or slower but they should produce the same output).

stephentoub commented 5 years ago

You're basically saying that the behavior of the compression algorithm is non-deterministic/unspecified?

No, it should be deterministic, on the same OS on the same version. If it's not we'd want to understand why.

but they should produce the same output

That's generally not how compression libraries work. There are no guarantee of exact same output across bug fixes, improvements to throughput, improvements to compression ratios, etc. What you're suggesting would be that a library would be prohibited from doing what it does better from version to version.

tmat commented 5 years ago

I'd expect it to be controlled by a parameter.

No, it should be deterministic, on the same OS on the same version.

Now that seems problematic. Consider a cloud build that runs on hundred machines. If some of the machines have been updated to a newer version and others have not then the result of the build will depend on where a particular dll is built. That is not good.

stephentoub commented 5 years ago

I'd expect it to be controlled by a parameter.

So every time Roslyn improves the IL it generates for the same input, it exposes a knob so that it's opt-in for every individual IL change? I don't think so.

tmat commented 5 years ago

That's different since I would expect the whole build to use the same compiler. I guess now we need to consider the version of zlib as well. Do we redistribute zlib with Core CLR or we use the one that's installed on the OS?

stephentoub commented 5 years ago

Do we redistribute zlib with Core CLR or we use the one that's installed on the OS?

On Linux and macOS where zlib is available everywhere, we use the one installed on the OS. On Windows where it's not, we have to distribute it with .NET Core, in the clrcompression.dll, and since we're distributing it anyway, we then use Intel's optimized version of zlib.

tmat commented 5 years ago

Hmm, can we always redist the zlib with Core CLR, so that at least with the same version of runtime we get the same outputs?

stephentoub commented 5 years ago

We try to redist as little as possible, a) so that we're not on the hook for servicing (especially security issues, zero-day vulnerabilities, etc.), b) to keep distribution sizes as small as possible, c) to allow admins to control the versions of such libraries the way they do elsewhere (lots of apps and frameworks use zlib), etc.

tmat commented 5 years ago

What that means though is that deterministic cloud build must run the same version of OS on all machines (e.g. in a container), using the same version of Core CLR is not enough.

FYI @jaredpar

danmoseley commented 5 years ago

If the build has non managed compression steps such as tar it could have the same issue, it is not specific to.NET.

jaredpar commented 5 years ago

Sounds like the build is still deterministic here. That is given the same inputs the compilation will produce the same outputs (bit for bit). The interesting bit here is that there are more inputs than we expected.

The set of inputs is most important when we're calculating a key for compilation in a caching environment. The idea being that we could re-use the output of a compilation on one machine on a separate machine assuming all of the inputs are the same. Sounds like OS version, or maybe zlib version, needs to be one of the keys.

That's unfortunate but not a big deviation from today as the operating system itself is already a part of the key. Different operating systems have different line endings and depending on how your gitattributes are setup this can change the line endings of source files. Line endings affect the end output of the build and hence if they're potentially different they need to generate different keys.

Another way to think about this: a git SHA is insufficient for uniquely identifying the source code of a product. The operating system needs to be factored in as well.

Given that the operating system already needs to be a part of any key we generate adding the version doesn't seem like a deal breaker. It's unfortunate but not too far removed from where we are today.

tmat commented 5 years ago

The interesting bit here is that there are more inputs than we expected.

I think that's a problem though. The more state we need to include in the key the less opportunity for reuse of build artifacts there is and the less the deterministic property of the build is useful.

In any case, it would be good if we could get away with only including zlib version but not OS version. Sounds like we need an API from CoreFX that exposes the zlib library name and its version then.

odalet commented 5 years ago

I come here from https://github.com/SixLabors/ImageSharp/issues/1027

I don't really bother .NET runtimes outputting different sizes due to different implementations, but these sizes should be in the same range. When running the same test app compiled with .NET Core 2 vs .NET 4.7.2 and using CompressionLevel.Optimal, the same PNG image has its size nearly tripled: I have a 102 KB image when running the net472 version vs 289 KB when running the .NET Core built one...

Is this really the intended behavior?

odalet commented 5 years ago

By using this (crappy) test code, I end up having a file 2x as large when run with .NET Core (2.2 or 3) compared to .NET 4.7.2 on Windows:

tests.zip

using System.IO;
using System.IO.Compression;

namespace ZLibTest
{
    internal sealed class Program
    {
        private static void Main(string[] args)
        {
            var bytes = GetImageBytes(350, 350);
            using (var fstream = File.Create(@"c:\temp\test.bin"))
            using (var stream = new DeflateStream(fstream, CompressionLevel.Optimal))
                stream.Write(bytes, 0, bytes.Length);
        }

        private static byte[] GetImageBytes(int width, int height)
        {
            var bytes = new byte[width * height * 4];
            for (var y = 0; y < height; y++)
            {
                for (var x = 0; x < width * 4; x += 4)
                {
                    bytes[4 * y * width + x] = (byte)((x + y) % 256); // R
                    bytes[4 * y * width + x + 1] = 0; // G
                    bytes[4 * y * width + x + 2] = 0; // B
                    bytes[4 * y * width + x + 3] = 255; // A
                }
            }

            return bytes;
        }
    }
}

antonfirsov commented 5 years ago

I hope @odalet's finding is a game changer for this issue.

I know there is no easy solution, but from BCL consumer's perspective DeflateStream looks like a very leaky abstraction in it's current form. To fix SixLabors/ImageSharp#1027 we basically need to reimplement the class.

stephentoub commented 5 years ago

cc: @ericstj

antonfirsov commented 4 years ago

After reading the whole discussion, it seems like a new issue shall be opened for https://github.com/dotnet/corefx/issues/40170#issuecomment-542144997_, since the problem scope is quite different.

In general, difference between output blobs is OK, since binary compatibility can't be expected here. EDIT: Might worth to warn users in docs though, to avoid misunderstandings in future.
But 2x difference in output size is a compression performance problem on the specific platform <= probably worth a chance to have it's own issue, better visibility, more feedback.

ericstj commented 4 years ago

Can it be demonstrated in a real world payload? We had a similar case here; https://github.com/dotnet/corefx/issues/34157 where we significantly regressed the compression on sparse files. The test case used by @odalet is not all that dissimilar from the smile24.bmp case in that issue. Although you observe 2x difference in compressed size, the compression savings is only slightly different 99.36% vs 98.5%, still significant but not quite as serious as 2x when you account for original size. @jtkukunas had a good comment about the trade-offs here: https://github.com/dotnet/corefx/issues/34157#issuecomment-481817942

I think the interesting case that's still not performing well is a repeating sequence.

antonfirsov commented 4 years ago

@ericstj for real world case, see https://github.com/dotnet/corefx/issues/40170#issuecomment-542130479, and the PNG image in the referenced ImageSharp issue.

Note: that issue has been closed because of an expensive library-level workaround (porting and optimizing SharpZipLib code): SixLabors/ImageSharp#1054

carlossanlop commented 4 years ago

Triage: Closing this original issue since it's by design: We don't guarantee bit for bit compression results between releases. @odalet, can you please open a new issue for the problem you described above?

ericstj commented 4 years ago

@antonfirsov I was thinking something more real world than a contrived pattern image. The tuning of the compression algorithm is a set of trade-offs it doesn't make sense to add special cases for contrived data at the detriment to real world data.

@odalet, You may find that https://github.com/dotnet/corefx/issues/36348 can help here in the cases folks demand better size.

Right now Optimal is giving better performance (faster compression) and similar compression on most real-world samples and compression test corpus data as we found in https://github.com/dotnet/corefx/issues/34157. If you still feel like opening an issue around repeated patterns on optimal we can try to have the zlib-intel owners investigate if the algorithm can be improved for these test cases.

JimBobSquarePants commented 4 years ago

Can we not dismiss real world use cases so flippantly please it comes across as rude. Repeated pattern graphics are commonly used in sprite based gaming, web design etc.

karelz commented 4 years ago

@JimBobSquarePants we are not trying to dismiss real world cases - I don't see your raised here. Maybe it is hidden in one of the previous links? Can you please elaborate on it? We are definitely not trying to be rude as you know, but we can't comment on things which are not here or are not clear. Also, we are only human beings and mistakes/omissions/mishaps can happen.

What would help is to clearly state: Why is binary output compat important for your scenario between .NET Core and .NET Framework?

antonfirsov commented 4 years ago

@karelz putting up ImageSharp 🎩: for us, it's not about binary compat, see: https://github.com/dotnet/corefx/issues/40170#issuecomment-566811738

(The discussion here is super misleading, I'll probably open a new issue.)

JimBobSquarePants commented 4 years ago

@karelz I was directly referencing this comment (emphasis mine).

@antonfirsov I was thinking something more real world than a contrived pattern image. The tuning of the compression algorithm is a set of trade-offs it doesn't make sense to add special cases for contrived data at the detriment to real world data.

The example provided by @odalet is a real world example. The difference in the compression result make a huge difference to expected output across platforms (massaging the number by comparing compression percentages does not help here) and affects performance across the wire.

karelz commented 4 years ago

Ok, I had a prepared answer and THEN I found out finally some bits and pieces that start to make sense and I vaguely understand what the misunderstanding here is about. I now totally agree with @antonfirsov that this issue is totally misleading.

The original problem is about output difference bit by bit and we closed it on that merit - see https://github.com/dotnet/corefx/issues/40170#issuecomment-567186454

We even asked to open a new issue for the 2x size in our answer, confirming @antonfirsov's recommendation: https://github.com/dotnet/corefx/issues/40170#issuecomment-566811738 ... which is BTW something I missed during triage and only realized now there are 2 independent problems interwined here.

So, stepping back: Let's pull the 2x problem into separate problem and discuss that. It should not have been mixed up with the original report in the first place to avoid this kind of confusion. Is that good enough resolution?

odalet commented 4 years ago

@ To all: I just read through all these comments, and I'll try to be more clear on why I commented here:

My primary issue is indeed with too big a difference in compression levels when using .NET Core on Windows vs .NET fx or .NET core on other platforms. I'm indeed not interested in having the exact same bits.

That's on me for not having identified my issue is different from the original one here.

Concerning the "real world" aspect of this: my example code was not meant to be "real world". It may be for some usages, but in fact not really for mine. I only supposed it may be useful to quickly exhibit the size differences in output.

I came up with another repro that better models what kind of images I'm dealing with. Basically I generate a 5000x5000x32b black image with a 1-pixel wide reddish square on it. This is still not a real-world image, but it is a better simulation, than my previous example. Here is the generation part:

private static byte[] GetImageBytes2()
{
    const int width = 5000;
    const int height = 5000;

    var random = new Random(1);

    // Returns a random temperature value between 64 and 192
    byte temperature() => (byte)(256 * (random.NextDouble() / 2.0 + 0.25));

    byte red(int x, int y) => x == 500 || x == 4500 || y == 500 || y == 4500 ? temperature() : (byte)0;

    var bytes = new byte[width * height * 4];
    for (var y = 0; y < height; y++)
    {
        for (var x = 0; x < width * 4; x += 4)
        {
            bytes[4 * y * width + x] = red(x, y); // R
            bytes[4 * y * width + x + 1] = 0; // G
            bytes[4 * y * width + x + 2] = 0; // B
            bytes[4 * y * width + x + 3] = 255; // A
        }
    }

    return bytes;
}

Sad news, the difference in size is even more obvious:

523 KB for the .NET Core 3/Windows build,
151 KB for the .NET 4.7.2 build And yes it is obviously very very sparse (hence related to dotnet/runtime#28235) .

@antonfirsov I suppose the first one whos has time/motivation enough to create the new issue will CC the other one so that he can add his own comments ;)

antonfirsov commented 4 years ago

Let's pull the 2x problem into separate problem and discuss that.

@karelz I'd like to take care of this (have a plan).

karelz commented 4 years ago

@antonfirsov sure thing - do you mean opening issue, or even working on a fix? Whomever opens the issue first, please link it from here. Thank you!

antonfirsov commented 4 years ago

@karelz issue first, to gather some feedback.

hkuno9000 commented 4 years ago

.NET Core 3.1 DeflateStream+Flush() ouput differs to ZipArchive output.

@tmat I found that: .NET Core's DeflateStream.Flush() appends "00 00 FF FF 03 00" at tail and change the head byte. .NET Core's DeflateStream.Close() dose not. This is similar your case.

This is seemed that "00 00 FF FF" are made by ZFlushCode.SyncFlush mode in the class System.IO.Compression.Deflater.

MS Test class

using Microsoft.VisualStudio.TestTools.UnitTesting;
using System;
using System.IO;
using System.IO.Compression;
using System.Linq;

namespace MSTests
{
    [TestClass()]
    public class MyTests
    {
        internal const string test_pattern1 = "hello, zip file world";
        internal const string test_pattern2 = "blank-----------world";

       [TestMethod, TestCategory("try")]
        [DataRow(test_pattern1)]
        [DataRow(test_pattern2)]
        public void DeflateStreamTest(string pattern)
        {
            // DeflateaStream vs ZipArchive
            byte[] data = System.Text.Encoding.UTF8.GetBytes(pattern);
            Console.WriteLine($"orignal data({data.Length} bytes): string '{pattern}'");

            // compress by DeflateStream
            using (var memory = new MemoryStream())
            using (var compressor = new DeflateStream(memory, CompressionLevel.Optimal, true))
            {
                compressor.Write(data, 0, data.Length);
#if USE_FLUSH
                Console.WriteLine("compressor.Flush()");
                compressor.Flush(); // output 00-00-FF-FF-03-00 at data tail.
#endif
                compressor.Close();
                byte[] compressed = memory.ToArray();
                Console.WriteLine($"DeflateStream comporessed({compressed.Length} bytes):\n dump {BitConverter.ToString(compressed)}");
            }

            // compress by ZipArchive
            using (var memory = new MemoryStream())
            {
                using (var archive = new ZipArchive(memory, ZipArchiveMode.Create, leaveOpen: true))
                {
                    var entry = archive.CreateEntry("AAA");
                    using (var writer = new BinaryWriter(entry.Open()))
                    {
                        writer.Write(data);
                    }
                }
                using (var archive = new ZipArchive(memory, ZipArchiveMode.Read, leaveOpen: true))
                {
                    var entry = archive.GetEntry("AAA");
                    var zipfile = memory.ToArray();
                    int i;
                    for (i = 0; i < zipfile.Length; ++i)
                    {
                        // find "AAA" position
                        if (zipfile[i] == 'A' && zipfile[i + 1] == 'A' && zipfile[i + 2] == 'A') break;
                    }
                    var rawData = zipfile.Skip(i + 3) // rawData is stored after entryname "AAA"
                        .Take((int)entry.CompressedLength)
                        .ToArray();
                    Console.WriteLine($"ZipArchive comporessed({entry.CompressedLength} bytes):\n dump {BitConverter.ToString(rawData)}");
                }
            }
        }
     }
}

Result USE_FLUSH is true

result is differ.

head byte: 4A to 4B
tail bytes: 00 00 00 FF FF 03 00 to -

orignal data(21 bytes): string 'blank-----------world'
compressor.Flush()
DeflateStream comporessed(21 bytes):
 dump 4A-CA-49-CC-CB-D6-45-80-F2-FC-A2-9C-14-00-00-00-00-FF-FF-03-00
ZipArchive comporessed(14 bytes):
 dump 4B-CA-49-CC-CB-D6-45-80-F2-FC-A2-9C-14-00

Result USE_FLUSH is false

result is same.

orignal data(21 bytes): string 'blank-----------world'
DeflateStream comporessed(14 bytes):
 dump 4B-CA-49-CC-CB-D6-45-80-F2-FC-A2-9C-14-00
ZipArchive comporessed(14 bytes):
 dump 4B-CA-49-CC-CB-D6-45-80-F2-FC-A2-9C-14-00

dotnet / runtime