"A local file header is corrupt" error after upgrading to 3.0/3.1

Xantrul commented 4 years ago

We have an application that reads and process information from zip files. After we upgraded target framework to 3.0 (or 3.1) from 2.1 without upgrading the code, processing of larger zip files started to fail with: System.IO.InvalidDataException: A local file header is corrupt.

This issue blocks us from upgrading to dotnet 3.1 and we need to upgrade asap to resolve the out of memory issues 2.1 has while running in container. I spent some time narrowing down the issue and here's what I found...

Sample code to repro the issue is:

using System;
using System.IO;
using System.IO.Compression;

namespace ZipProblemRepro
{
    class Program
    {
        private const string ZipPath = "F:/Test/bad_header/java_1_file.zip";

        public static void Main(string[] args)
        {
            Console.WriteLine($"Starting to read '{ZipPath}'...");

            using (var zip = ZipFile.Open(ZipPath, ZipArchiveMode.Read))
            {
                var linesProcessed = 0L;

                foreach (var fileEntry in zip.Entries)
                {
                    using (var stream = fileEntry.Open())
                    {
                        linesProcessed += ReadLines(stream);
                    }
                }

                Console.WriteLine($"Processed {linesProcessed} lines");
            }

            Console.WriteLine("Completed successfully!");
        }

        private static int ReadLines(Stream stream)
        {
            var linesRead = 0;
            using (var reader = new StreamReader(stream))
            {
                string line;
                while (!reader.EndOfStream)
                {
                    try
                    {
                        line = reader.ReadLine();
                    } catch (OutOfMemoryException ex)
                    {
                        line = null;
                    }

                    ++linesRead;
                }
            }

            return linesRead;
        }
    }
}

While running this code on the same zip file, it succeeds when target framework is netcoreapp2.1, but fails when targeting netcoreapp3.0 or netcoreapp3.1. All tests were performed on the same machine with .NET Core SDK 3.1.100

The hardest part was to produce a brand new zip to repro the problem. Turned out that 2 things are required to get a repro zip:

Zip should contain a large file in it
Zip should be created with Java zip library. This sounds weird, but this is the only way I was able to repro this. Tried different settings for 7z, Windows Explorer compression and command line zip, but no luck with any of those.

I am including the code I used to generate repro zip below, but I can also supply this zip if you tell me where to upload it (315MB).

C# code I used to generate repro file:

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

namespace ZipGenerator
{
    class Program
    {
        private const string TempDirPath = "Temp";

        private const int GuildsPerLine = 10;
        private const int LinesPerFile = 12000000;
        private const int FilesPerDir = 1;
        private const int DirsTotal = 1;

        public static void Main(string[] args)
        {
            Console.WriteLine("Generating...");

            Directory.CreateDirectory(TempDirPath);
            for (var dirCount = 0; dirCount < DirsTotal; dirCount++)
            {
                if (dirCount % 1 == 0)
                {
                    Console.WriteLine($"Generated {dirCount} out of {DirsTotal} directories so far...");
                }

                var dirName = Guid.NewGuid().ToString("N");
                var dirPath = Path.Combine(TempDirPath, dirName);
                Directory.CreateDirectory(dirPath);

                for (var fileCount = 0; fileCount < FilesPerDir; fileCount++)
                {
                    var fileName = Guid.NewGuid().ToString("N");
                    var filePath = Path.Combine(dirPath, fileName);
                    GenerateEasierToCompressFileWithGuids(filePath);
                }
            }

            Console.WriteLine("Files created!");
        }

        private static void GenerateEasierToCompressFileWithGuids(string filePath)
        {
            using (var file = new StreamWriter(filePath))
            {
                for (var lineCount = 0; lineCount < LinesPerFile; lineCount++)
                {
                    var sb = new StringBuilder();
                    var fixedGuidPerLine = Guid.NewGuid().ToString("N");

                    for (var guidInLineCount = 0; guidInLineCount < GuildsPerLine; guidInLineCount++)
                    {
                        sb.Append(fixedGuidPerLine);
                    }

                    file.WriteLine(sb.ToString());
                }
            }
        }
    }
}

Java code to zip the file:

package com.test.zipper;

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.zip.Deflater;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

public class Main {
    private static final String dirToZip = "D:\\VS\\Sandbox\\ZipProblemRepro\\Temp";

    public static void main(String[] args) throws IOException {
        String outputFile = "F:\\Test\\bad_header\\javaOutput.zip";

        System.out.println("Zipping everything in '" + dirToZip + "'");

        FileOutputStream fos = new FileOutputStream(outputFile);
        BufferedOutputStream bos =  new BufferedOutputStream(fos);
        ZipOutputStream zipOutputStream = new ZipOutputStream(bos);
        zipOutputStream.setLevel(Deflater.BEST_SPEED);

        Files.walk(Paths.get(dirToZip))
                .filter(Files::isRegularFile)
                .forEach((path) -> ZipFile(zipOutputStream, path));

        System.out.println("Done zipping. Closing everything...");

        zipOutputStream.close();
        bos.close();
        fos.close();

        System.out.println("Done!");
    }

    private static void ZipFile(ZipOutputStream zipOutputStream, Path path) {
        try {
            String fileToZip = path.toAbsolutePath().toString();
            String pathInZip = fileToZip.substring(dirToZip.length() + 1);
            FileInputStream fis = new FileInputStream(fileToZip);
            ZipEntry zipEntry = new ZipEntry(pathInZip);
            zipOutputStream.putNextEntry(zipEntry);
            byte[] bytes = new byte[1024];
            int length;
            while((length = fis.read(bytes)) >= 0) {
                zipOutputStream.write(bytes, 0, length);
            }
            zipOutputStream.closeEntry();
        } catch (Exception ex) {
            throw new RuntimeException("Failed while zipping. Exception was: " + ex.getMessage());
        }
    }
}

danmoseley commented 4 years ago

Thanks @Xantrul . Given the time of year, we likely can't look at this immediately. Would you consider trying 3.0 previews to get a better idea when the problem change went in? They can be installed side by side here https://dotnet.microsoft.com/download/dotnet-core/3.0

danmoseley commented 4 years ago

@carlossanlop

Xantrul commented 4 years ago

@danmosemsft Totally understand re time of the year - I am on vacation myself 🎄 I'll see if someone from my team is in the office and can compare different previews, or I can do it myself (if still relevant) in January.

Xantrul commented 4 years ago

@danmosemsft Tried few different versions of SDK and it looks like the problem was introduced in preview7 (preview6 works fine).

Xantrul commented 4 years ago

Any news on this one?

MikeCodesDotNET commented 4 years ago

I've also hit this bug in my WPF app which handles large Zip files. I've tested targeting .NET framework and everything works as expected with the issue only happening with .NET Core.

I'll be in Redmond next week (3rd of Feb) if you're around and need a few example files that cause problems. It'd also be great to get an understanding of when a fix might be pushed.

danmoseley commented 4 years ago

@eerhardt any cycles to look at this one? Possible servicing candidate.

eerhardt commented 4 years ago

Yep. I’ll take a look tomorrow.

eerhardt commented 4 years ago

I was able to repro the original issue posted and debugged the problem.

From what I can tell, the error is here:

https://github.com/dotnet/runtime/blob/6a0dafb7f8fa5ac3531be56dc803c3ae92e49201/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipBlocks.cs#L519-L520

In the originally posted issue, the uncompressed file is larger than int.MaxValue, but smaller than uint.MaxValue. Thus, java is writing the .zip file as 32-bit and we are reading the values as signed 32-bit integers. In this case, the value is larger than int.MaxValue, so when read and converted to a long, the value becomes negative. Later in the method, we compare uncompressedSize to the entry.Length, and that fails.

I've pushed a branch with the fix applied: https://github.com/eerhardt/runtime/tree/Fix1094. This allows the original file to be read successfully. I'll add a test to that branch and then make a PR for the fix. We also should service 3.1 with the fix, as other people can hit this problem as well.

/cc @buyaa-n

@MikeCodesDotNET - I was also able to reproduce the same exception message with the file you sent me Clay Paky@Golden Scan HPE@justatest (1).gdtf. However, even with the above fix, the exception still occurs. The is because the values read from the stream aren't matching:

I'm unsure of what your error is - it may actually be a corrupt .zip file. Can you open a new issue for your scenario?

For the user.zip file you sent me, I was unable to repro an exception.

danmoseley commented 4 years ago

Nice. The same error exists for 64 bit just above - although it's safe to say that it will never cause a problem..

                    compressedSize = reader.ReadInt64();
                    uncompressedSize = reader.ReadInt64();

   4.4.1 General notes on fields

      4.4.1.1  All fields unless otherwise noted are unsigned and stored
      in Intel low-byte:high-byte, low-word:high-word order.

danmoseley commented 4 years ago

Actually, maybe not -- it seems we use signed longs even in the public API - and there are checks elsewhere (SR.FieldTooBigUncompressedSize) that it doesn't overflow that.

MikeCodesDotNET commented 4 years ago

@eerhardt I've looked into the files a little more and it looks like you're correct that the Zip is invalid.

I've pinged the company responsible for creating the files to see if they've any input on this. It looks like one of the images is the cause of the problem, with most zip tools will gracefully handle the errors (as .NET framework did before).

Testing wheels\00590129.png
Invalid local header for wheels\00590129.png encountered.
Current Location part 1 offset 0
Local directory entry PK0304 (4+26): #1

    operat. system version needed to extract (00):  MS-DOS, OS/2, NT FAT
    unzip software version needed to extract (20):  2.0
    general purpose bit flag (0x0008) (bit 15..0):  0000.0000 0000.1000
      file security status  (bit 0):                not encrypted
      extended local header (bit 3):                yes
    compression method (00):                        none (stored)
    file last modified on (0x00005037 0x0000aed4):  2020-01-23 21:54:40
    32-bit CRC value:                               0x3b380dc5
    compressed size:                                6327 bytes
    uncompressed size:                              6327 bytes
  note: "real" crc and sizes are in the extended local header
    length of filename:                             19 characters
    length of extra field:                          0 bytes
    filename: wheels\00590129.png

Because this seems to be expected behaviour with .NET Core, I wont log any further issues regarding this. Though, It'd be nice to have an option to extract without validation.

Cheers, Mike J

eerhardt commented 4 years ago

Though, It'd be nice to have an option to extract without validation.

I was thinking something similar actually. Your zip file does load successfully in .NET Core 2.1, but with the extra validation we added in 3.1, it now fails.

@buyaa-n @ericstj @carlossanlop @ahsonkhan @ViktorHofer - thoughts? It would probably require adding a new public API to disable the validation - unless we wanted to "quirks"-mode it, but that wouldn't be my vote. I can propose something in a new issue, if people think it would be valuable.

MikeCodesDotNET commented 4 years ago

It would 100% be valuable given the breaking change between 2.1 and 3.1. Though I'll leave it to experts to decide if its right to include and how :)

As an FYI, given that my issue only appeared after migrating to 3.1 from Framework (I'm using WPF), should the API not keep backwards compatibility so that desktop applications which rely on this functionality continue to work?

d0tn3tc0d3r commented 4 years ago

I ran into the same exception while opening some files contained in in large zip files from an external source. It wouldn't surprise me if they were generated by Java based tools. The files can be read without problems with the same code compiled for the .NET Framework. Converting the project to .NET Core 3.1 is now blocked by this issue.

The issue in the zip files that I have is that some files are flagged as Zip64 (version needed to extract = 45 in the local file header), but the data descriptor contains 32 bit sizes (and are less than 4 GiB in size). So the check whether the size in the data descriptor matches the one in the central directory record fails. I implemented an ugly workaround for it, but then I ran into the issue that the check still fails for files with sizes >= 2 GiB (and less than 4 GiB) due to the sized incorrectly being interpreted as signed instead of unsigned. So the fix @eerhardt pushed would be very welcome, as it should make it possible to read the files again with the workaround.

Anyway, 7-Zip, .NET Framework and other tools I have tried have no issues with the incorrect data descriptor. The information needed to read the data is in the central directory (and present in the ZipArchiveEntry), so not sure why the extra check has been implemented in the first place. It causes issues with files that might technically not conform to the specifications, but other software reads just fine. And I haven't found any software that can actually verify the correctness of a zip file and show the issue (to make it easier to report the problem to the organization that generated those files). So it would be nice to know which tool @MikeCodesDotNET used to check the zip file. For people who don't dive into the structure of zip files, it will seem to be a problem in .NET Core (3.0 and 3.1).

If the validation is really needed, it could be an option to change the validation so that it can handle the case where a 32 bit data descriptor is present for files smaller than 4 GiB that are incorrectly tagged as Zip64. Not a nice solution either, but if there are a lot of zip files out there with similar issues, it could prevent a lot of frustration.

ericstj commented 4 years ago

We should fix the check to read as Uint32 in servicing for 3.1. You don't need API, typically we do this sort of thing with AppContext switches, which can be controlled via runtimeconfig IIRC. The benefit there is that you can do it in servicing without API.

I'd prefer that we don't dirty the API here since this was clearly a bug in the validation. If we find cases where the validation is not buggy and we still flag legit zips (eg: a widespread bad behaving zip library) then we should consider adding an API to suppress the error.

MikeCodesDotNET commented 4 years ago

@d0tn3tc0d3r, I used the 'Advanced Diagnostics' in WinZip to validate the zip files. If you use the standard validation, it'll actually show the Zip files as valid!

@ericstj As far as I can see, most zip tools do not validate, which is why WinZip and 7Zip will successfully unzip these invalide zips. Given that .NET Framework also gracefully handled the invalid zip fles, should we not try and keep backwards compatability? My preference would be add an optional boolean arguement to ExtractToDirectory method which enables or disables validation. Ideally with the default to disable validation to keep compability with .NET framework and 2.1.

buyaa-n commented 4 years ago

@d0tn3tc0d3r

The issue in the zip files that I have is that some files are flagged as Zip64 (version needed to extract = 45 in the local file header), but the data descriptor contains 32 bit sizes (and are less than 4 GiB in size). So the check whether the size in the data descriptor matches the one in the central directory record fails.

I think that issue should be fixed too, @eerhardt could you add fix for that with your Uint32 fix? or i can work on the fix CC @ericstj

so not sure why the extra check has been implemented in the first place.

We had issue ZipArchive extracting a tampered zip file which Uncompressed size in central directory is much bigger that real Uncompressed size https://github.com/dotnet/runtime/issues/27741

d0tn3tc0d3r commented 4 years ago

@buyaa-n Thanks for the explanation of the reasons for the validation.

In any case, I checked the zip file with WinZip Detailed Diagnostics, and it did not report any errors. These are the most important parts of the test results for a file that can't be read using .NET Core 3.0 or 3.1:

Central directory entry PK0102 (4+42): #20
======================================
    relative offset of local header:                4294967295 (0xffffffff) bytes
    version made by operating system (00):          MS-DOS, OS/2, NT FAT
    version made by zip software (45):              4.5
    operat. system version needed to extract (00):  MS-DOS, OS/2, NT FAT
    unzip software version needed to extract (45):  4.5
    general purpose bit flag (0x0808) (bit 15..0):  0000.1000 0000.1000
      file security status  (bit 0):                not encrypted
      extended local header (bit 3):                yes
      UTF-8 names          (bit 11):                yes
    compression method (08):                        deflated
      compression sub-type (deflation):             normal
    32-bit CRC value:                               0x0e4b27fc
    compressed size:                                12592413 bytes
    uncompressed size:                              285871364 bytes
    extra field 0x0001 (ZIP64 Tag), 4 header and 8 data bytes:
    ZIP64 Tag Value(s):
      Value #1:                                     16079571729

Current Location part 1 offset 16079571729
Local directory entry PK0304 (4+26): #20
------------------------------------
    operat. system version needed to extract (00):  MS-DOS, OS/2, NT FAT
    unzip software version needed to extract (20):  2.0
    general purpose bit flag (0x0808) (bit 15..0):  0000.1000 0000.1000
      file security status  (bit 0):                not encrypted
      extended local header (bit 3):                yes
      UTF-8 names          (bit 11):                yes
    compression method (08):                        deflated
      compression sub-type (deflation):             normal
    32-bit CRC value:                               0x00000000
    compressed size:                                0 bytes
    uncompressed size:                              0 bytes
  note: "real" crc and sizes are in the extended local header

Current Location part 0 offset 16092164183
Extended local dir entry PK0708 (4+12): #20
---------------------------------------
    32-bit CRC value:                               0x0e4b27fc
    compressed size:                                12592413 bytes
    uncompressed size:                              285871364 bytes

So, only the offset is in Zip64 format, that's why version required is 45 in the central directory. The uncompressed and compressed sizes are not specified in 64 bit format, as that is not needed for this file. That might be why it's not stored as 64 bits in the data descriptor. WinZip Detailed Diagnostics does not report that as a problem though.

If needed, I can send someone a link to a file that can be downloaded. It might also be reproducible by creating a zip file that's bigger than 4 GiB and has a file that's smaller than 4 GiB and has a data descriptor, with the local file header located past the 4 GiB offset in the file.

eerhardt commented 4 years ago

If needed, I can send someone a link to a file that can be downloaded

Yes, can you please send it to me? My email is public in my GH profile: https://github.com/eerhardt

MikeCodesDotNET commented 4 years ago

@eerhardt Do you think we'll get an API / the ability to gracefully handle invalid headers or should I focus my efforts on finding a workaround?

Cheers, Mike J

buyaa-n commented 4 years ago

@d0tn3tc0d3r the zip you mentioned above and the zip @MikeCodesDotNET posted before are invalid zips. But as other tools creating them and accepting them we might update the API to allow the zip files having that minor issue, it would be helpful if you can share your file @MikeCodesDotNET for testing

buyaa-n commented 4 years ago

@MikeCodesDotNET never mind, I got the file from @eerhardt

eerhardt commented 4 years ago

@eerhardt Do you think we'll get an API / the ability to gracefully handle invalid headers or should I focus my efforts on finding a workaround?

The current plan is to introduce an AppContext switch into both 3.1 and 5.0 that will turn the new validation off. So in your application, if you set this setting, it will allow ZipArchive to load even if the headers are invalid, basically falling back to .NET Core 2.1 behavior.

Along with that switch, we are also fixing the bug mentioned above - changing ReadInt32 to ReadUInt32.

MikeCodesDotNET commented 4 years ago

@eerhardt Wonderful! This is perfect.

I'll keep badgering the original zip creators to fix it their end as well but I appreciate the ability to handle it on the .NET side as well.

Thank you!!

d0tn3tc0d3r commented 4 years ago

@buyaa-n based on which facts are you saying that the zip I'm reading is invalid? WinZip doesn't report it as invalid, and I haven't found any tool that does. Nobody else has reported having problems with these files as far as I know, and I have old files going back nearly 4 years that have the same issue. Right now, I don't think I can report the files as being invalid.

The specifications regarding the data descriptor are a bit lacking in my opinion. It does not mention at all how a reader can know whether the data descriptor will have 64 bit or 32 bit sizes when reading the zip in streaming mode. And when writing in streaming mode, the writer doesn't need to know what the sizes will be at the time the local file header is written, so it can't be based on information in that header either. So I think that using the "version needed to extract" from the central directory to decide whether to read the sizes in the data descriptor as 32 bit or 64 bit values (and throw an exception if it's not the case) is not a good implementation.

Anyway, I already sent @eerhardt a link, but I sent it to @MikeCodesDotNET as well.

buyaa-n commented 4 years ago

@d0tn3tc0d3r sorry i told that based on your previous comment:

The issue in the zip files that I have is that some files are flagged as Zip64 (version needed to extract = 45 in the local file header), but the data descriptor contains 32 bit sizes (and are less than 4 GiB in size). So the check whether the size in the data descriptor matches the one in the central directory record fails.

Which sounded like invalid zip to me, but with debugging your file we see the issue same as you described above which is a bug, we are working on the fix, thanks for the file

eerhardt commented 4 years ago

Looking at the implementation that used to be in the dotnet/wpf repo, they used a more forgiving algorithm for reading the data descriptor. Maybe using the same algorithm would work for us. I believe it would fix @d0tn3tc0d3r's scenario where some data descriptors have 4 byte lengths and some have 8 byte lengths.

Looking at the spec https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT, the rule seems ambiguous to me:

 4.3.9.2 When compressing files, compressed and uncompressed sizes 
 SHOULD be stored in ZIP64 format (as 8 byte values) when a 
 file's size exceeds 0xFFFFFFFF.   However ZIP64 format MAY be 
 used regardless of the size of a file.  When extracting, if 
 the zip64 extended information extra field is present for 
 the file the compressed and uncompressed sizes will be 8
 byte values.

The When extracting, if the zip64 extended information extra field is present for the file the compressed and uncompressed sizes will be 8 byte values. section is the key here. I could interpret it at least 2 ways:

If there are any Zip64 extended information extra fields, then the sizes will be 8 byte values
(the way @d0tn3tc0d3r's zip appears to be created) – If there are Zip64 extended information extra fields for either compressed size or uncompressed size, then the size will be 8 byte values.

An example of how these are different is that the relative offset of local header field may be present in the ZIP64 extended information extra fields, but not the compressed or uncompressed sizes. In @d0tn3tc0d3r's zip, when this occurs the data descriptor lengths are still 4 bytes.

I assume that is why the dotnet/wpf implementation tries reading it each way and comparing the values with the central directory entry to figure out which way it was written.

d0tn3tc0d3r commented 4 years ago

While experimenting a bit with the .NET Core ZipArchive, I found a way to create a zip file containing a file with a data descriptor that has 32 bit sizes, but is at an offset of more than 4 GiB. .NET Core 3.1 (without fix for the signed int issue) does not read any of the entries in the zip generated by this code.

using System;
using System.IO;
using System.IO.Compression;

namespace CreateZip
{
    public class NoSeekFileStream : FileStream
    {
        public NoSeekFileStream(string path, FileMode mode) : base(path, mode) { }
        public override bool CanSeek => false;
    }
    class Program
    {
        static void Main(string[] args)
        {
            using (var zipStream = new NoSeekFileStream("DotNETCoreTest.zip", FileMode.Create))
            using (var zipArchive = new ZipArchive(zipStream, ZipArchiveMode.Create))
            {
                var buffer = new byte[0x10000];
                using (var stream = zipArchive.CreateEntry("test1.bin", CompressionLevel.NoCompression).Open())
                    for (int i = 0; i < 0x8000; i++) stream.Write(buffer);
                using (var stream = zipArchive.CreateEntry("test2.bin", CompressionLevel.NoCompression).Open())
                    for (int i = 0; i < 0x8000; i++) stream.Write(buffer);
                using (var stream = zipArchive.CreateEntry("test3.bin", CompressionLevel.NoCompression).Open())
                    stream.Write(buffer, 0, 0x2A);
            }
        }
    }
}

So I guess the people that implemented the support for writing zip files to a stream can explain their interpretation of the zip specifications. SharpZipLib can also generate zip files that have a similar structure, so there are multiple implementations that do the same thing.

buyaa-n commented 4 years ago

The current plan is to introduce an AppContext switch into both 3.1 and 5.0 that will turn the new validation off.

We have removed local header validation vs central directory, therefore no need AppContext switch. Decompressed stream will still be restricted by uncompressed size.

Fix merged to 5.0, reopening for porting to 3. 1

jonathanantoine commented 4 years ago

Hello,

I just wanted to share an ugly workaround for those who needs to stay on 3.1 and can't wait for the release of this fix. Instead of calling :

 firstEntry.Open();

Call this private OpenInReadMode method using reflection but disabling the header validation :

 firstEntry.GetType()
   .GetMethod("OpenInReadMode", BindingFlags.NonPublic | BindingFlags.Instance)
   .Invoke(firstEntry, new object[] {false}) as Stream;

dotnet / runtime

"A local file header is corrupt" error after upgrading to 3.0/3.1 #1094