bp74 / Zstandard.Net

A Zstandard wrapper for .Net
Other
137 stars 27 forks source link

Compressed data incompatible between different implementations of zstandard (Python, C# Simple) Diff. Length or header? #17

Closed Grantelbob closed 5 years ago

Grantelbob commented 5 years ago

Hello,

as i found there is a difference in result compressed data using this library and other implementations of zstandard. My goal is to stream compress a large JSON String and decompress it at Python side.

Libraries used (all featured at https://facebook.github.io/zstd/): 1: https://github.com/bp74/Zstandard.Net/ (this one) 2: https://github.com/skbkontur/ZstdNet (simple version, no Stream support, i swapped zstd.dll to v1.3.8) 3: https://pypi.org/project/zstandard/ (Python Full Version)

Information:

For convenience and readability i converted the resulting byte array after compression into Base64. This problem seems to only appear in this Library, as the simple version and Python implementations producing "correct" outputs. For testing i have been using compression level: 11.

Problem: IT IS NOT possible to compress a String using this library and the code shown below and decompress it at Python side later. Error: ZstdError('could not determine content size in frame header')

There is no problem whatsoever in using same methods and another algo like GZIP. With using of GZIP as compressor (same sourcecode, just swap compressor) there will be no problem when decompressing at python side. So i assume there is no error from my side in how to use stream compression.

Also there is no problem if you decompress the result using the same library. Only when mixing implementations.

In my case i got a very large class object which will be JSON Formatted and instantly compressed using Stream methods. (where simple approaches just consume too much memory).

Problem, compressed results:

SourceString = "This is a unbelievable long string..." (i understand that there is almost no compression potentional in this string, same problem occurs with longer strings also)

Compressed Output using-: 1: (this Library) KLUv/QBgKAEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLgEAAA== 2: (Simple Version) KLUv/SAlKQEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLg== 3: (Python Full Version) KLUv/SAlKQEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLg==

As can be seen there is a clear difference in START and END of Data. I guess this is where Byte Length etc will be written? KLUv/QBgKAEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLgEAAA== KLUv/SAlKQEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLg==

Code used (VB.NET):

Compressing:

1: (This Library)

Imports Zstandard.Net

Dim data As Byte() = UTF8Encoding.UTF8.GetBytes("This is a unbelievable long string...")
Using mso As MemoryStream = New MemoryStream()
    Using compressionStream As ZstandardStream = New ZstandardStream(mso, CompressionMode.Compress)
        compressionStream.CompressionLevel = 11
        compressionStream.Write(data, 0, data.Length)
        compressionStream.Close()
        Return Convert.ToBase64String(mso.ToArray(), Base64FormattingOptions.None)
    End Using
End Using

2: (simple Version Library)

Imports ZstdNet

Dim data As Byte() = UTF8Encoding.UTF8.GetBytes("This is a unbelievable long string...")
Using comp = New ZstdNet.Compressor(New CompressionOptions(11))
    Return Convert.ToBase64String(comp.Wrap(data), Base64FormattingOptions.None)
End Using

3: (Python Full Library):

import zstandard as zstd
import base64

cctx = zstd.ZstdCompressor(level=11)
Text = "This is a unbelievable long string..."
compressedData = cctx.compress(Text.encode("utf-8"))
base64Data = base64.b64encode(compressedData).decode("utf-8")

Testing compatibility (in Python):

Decompressing of String compressed with Simple Library (2.):

dctx = zstd.ZstdDecompressor()

Test1 = "KLUv/SAlKQEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLg=="
bytedata = base64.b64decode(Test1)
TestString = dctx.decompress(bytedata).decode("utf-8")
print(TestString) 

#Will return: "This is a unbelievable long string..."

Decompressing of String compressed with this Library (1.):

dctx = zstd.ZstdDecompressor()

Test1 = "KLUv/QBgKAEAVGhpcyBpcyBhIHVuYmVsaWV2YWJsZSBsb25nIHN0cmluZy4uLgEAAA=="
bytedata = base64.b64decode(Test1)
TestString = dctx.decompress(bytedata).decode("utf-8")

#Will throw exception!: ZstdError('could not determine content size in frame header')

I hope that i only made a mistake using this library, any help would be appreciated. Otherwise (if calculated lengths or frame header is incorrect) a fix would be a breaking change since there are may already programs depending on this broken? behavior. Previously compressed files using this library would not be readable anymore after fix.

Thank you all very much for helping. I hope that i could help some other persons by addressing my Issue.

Best regards, Marcel

Grantelbob commented 5 years ago

Wow, i am very sorry. I could resolve my problem. everything ok here. Just not enough understanding from my side. I will close the Issue.

As i just learned, data compressed via streaming operators does not include length. So you just have to simply use Streaming API in other implementations too...

For Python ZStandard implementation this would be:

ZStandardDcmp = zstd.ZstdDecompressor()

compressedData = base64.b64decode(Base64EncodedData)
stream_reader = ZStandardDcmp.stream_reader(compressedData)
decompressedTXTData = stream_reader.read().decode('utf-8-sig')
stream_reader.close()
zxzharmlesszxz commented 1 year ago

It's really happening right now compressed cmd zstd cannot be read by python and vice versa

calclavia commented 8 months ago

Running into the same issue. Any follow up on this?