MiloszKrajewski / K4os.Compression.LZ4

LZ4/LH4HC compression for .NET Standard 1.6/2.0 (formerly known as lz4net)
MIT License
675 stars 77 forks source link

Different Output After Copying Stream #60

Closed RLashofRegas closed 1 year ago

RLashofRegas commented 3 years ago

Description I am getting strange results with a specific lz4 file. I have a tar.lz4 where the tar contains a bunch of json files. When I try to deserialize the json files it is failing with some garbage data. I found that with the particular lz4 file I am using, if I pass the LZ4DecoderStream directly into a TarInputStream (SharpZipLib) I see the garbage data (actually repeat of a section of the end of one of the files because ReadBlock is returning 0 so TarInputStream is continuing with the same buffer that was filled on the previous call). However, if I first copy the LZ4DecoderStream into a MemoryStream using CopyTo it reads correctly. This same lz4 file can successfully be decoded by lz4.exe and 7-zip-zstd (although I guess it's not a 1-1 test because LZ4DecoderStream can also do it successfully if I am going to a file on disk vs passing it into another stream). After I observed this behavior I compared the stream after being copied to a MemoryStream and the original and they differ.

To reproduce Apologies this is a little vague but so far I have only been able to reproduce this issue with one file that contains proprietary data so I cannot share. The same file works if I decompress it with 7-zip or lz4.exe and then re-compress it.

Steps to reproduce the behavior:

string lz4File_2 = Path.Combine(outputDir, "TestFiles.tar.lz4");
File.Copy(lz4File, lz4File_2);
using (FileStream inputFileStream = File.OpenRead(lz4File))
using (LZ4DecoderStream decompressionStream = LZ4Stream.Decode(inputFileStream))
using (FileStream inputFileStream_2 = File.OpenRead(lz4File_2))
using (LZ4DecoderStream decompressionStream_2 = LZ4Stream.Decode(inputFileStream_2))
using (var intermediateStream = new MemoryStream())
{
    decompressionStream_2.CopyTo(intermediateStream);
    intermediateStream.Position = 0;

    int originalByte, intermediateByte;
    do
    {
        originalByte = decompressionStream.ReadByte();
        intermediateByte = intermediateStream.ReadByte();

        if (originalByte != intermediateByte)
        {
            throw new Exception("Bytes are not equal");
        }
    }
    while (originalByte != -1 && intermediateByte != -1);
}

Expected behavior Should not throw exception as both streams should be identical

Actual behavior Throws exception. Here are some details from the break point of the exception:

Environment

MiloszKrajewski commented 3 years ago

I'll take a look at this next week. Can you check reading blicks instead of single bytes? GetByte is least tested API.

RLashofRegas commented 3 years ago

Interesting. So when I simply read the stream as blocks using Stream.Read() it does not throw an exception (the streams are equivalent), this code:

private static byte[] ReadStream(Stream stream, int length, int blockSize = 1024)
{
    byte[] bytes = new byte[length + blockSize];
    int numBytesToRead = length;
    int numBytesRead = 0;
    do
    {
        int n = stream.Read(bytes, numBytesRead, blockSize);
        numBytesRead += n;
        numBytesToRead -= n;
    }
    while (numBytesToRead > 0);

    return bytes;
}

private static void ReadBlocks(Stream decompressionStream, Stream intermediateStream)
{
    byte[] decompressionBytes = ReadStream(decompressionStream, (int)intermediateStream.Length);
    byte[] intermediateBytes = ReadStream(intermediateStream, (int)intermediateStream.Length);

    for (int i = 0; i < decompressionBytes.Length; i++)
    {
        if (decompressionBytes[i] != intermediateBytes[i])
        {
            throw new Exception("Bytes not equal.");
        }
    }
}

However, when I read them using the TarInputStream from the SharpZipLib library they are not equivalent. Namely, tarStream.GetNextEntry() throws "Header checksum invalid" here. This is the original error that led me down the rabbit hole of comparing the streams, and interestingly TarInputStream is calling Read() not ReadByte() but it's still causing problems. SharpZipLib code for that is here. If you note the comment there about "We have found EOF, and the record is not full!" that is the problem that I referenced in the original post that is causing the garbage data at the end of the stream because SharpZipLib is just returning the same bytes that were read on the previous call to ReadBlock. My code for reading the tar archives is as follows (again, this fails on tarStream.GetNextEntry() which is after intermediateTarStream.GetNextEntry() so the intermediate stream did not throw the same header checksum invalid error):

private static void ReadTar(LZ4DecoderStream decompressionStream, MemoryStream intermediateStream)
{
    using (var tarStream = new TarInputStream(decompressionStream, Encoding.UTF8))
    using (var intermediateTarStream = new TarInputStream(intermediateStream, Encoding.UTF8))
    {
        TarEntry tarEntry, intermediateTarEntry = null;
        while (true)
        {
            intermediateTarEntry = intermediateTarStream.GetNextEntry();
            tarEntry = tarStream.GetNextEntry();
            if (tarEntry == null || intermediateTarEntry == null)
            {
                if (tarEntry == null && intermediateTarEntry == null)
                {
                    break;
                }
                else 
                {
                    throw new Exception("One stream ended, the other still has data.");
                }
            }

            if (tarEntry.IsDirectory || intermediateTarEntry.IsDirectory)
            {
                if (tarEntry.IsDirectory && intermediateTarEntry.IsDirectory)
                {
                    continue;
                }
                else
                {
                    throw new Exception("One stream found a directory and the other didn't");
                }
            }

            using (var originalEntryContents = new MemoryStream())
            using (var intermediateEntryContents = new MemoryStream())
            {
                tarStream.CopyEntryContents(originalEntryContents);
                intermediateTarStream.CopyEntryContents(intermediateEntryContents);

                ReadByteByByte(originalEntryContents, intermediateEntryContents);
            }

        }
    }
}
MiloszKrajewski commented 3 years ago

I did some testing with Tar streams and it works fine.

I mean I understand that it might be a bug but it is not a general problem, there must be something specific to data. I actually guess that it might be something about how SharpZipLib call LZ4Stream (for example, tries to read -1 bytes), because I understand that if you decompress first and then use TarInputStream it is all fine? it fails ONLY if TarInputStream reads directly from LZ4Stream?

Anyway, I think without actual files it becomes a wild goose chase. Maybe you can generate fake data having same problem?

MiloszKrajewski commented 3 years ago

@RLashofRegas any luck reproducing it?

RLashofRegas commented 3 years ago

Sorry been super busy with other things but no. I will try next chance I get but I had tried previously and was not able to reproduce with other files.