adamhathcock / sharpcompress

SharpCompress is a fully managed C# library to deal with many compression types and formats.
MIT License
2.27k stars 481 forks source link

GZipped Excel xlsx files choke #291

Open mklaber opened 7 years ago

mklaber commented 7 years ago

Excel's xlsx format is really just a Zipped XML file. If such files are gzipped, the ReaderFactory seems to try to un-gzip and then un-zip the content. This leads to an IEntry.Key of the first parts of the file rather than the name of the file.

To reproduce:

  1. Create an Excel Workbook file (*.xlsx)
  2. Gzip it: gzip -k Book1.xlsx
  3. Read it with ReaderFactory:
    using (var stream = File.OpenRead(@"C:\\tmp\\Book1.xlsx.gz"))
    {
        using (var archive = ReaderFactory.Open(stream))
        {
            while (archive.MoveToNextEntry())
            {
                archive.Entry.Key.Dump();
            }
        }
    }

The Entry.Key value that is dumped is PK   ! A7��n   [Content_Types].xml �(� 

I'd expect the Key to be the file name Book1.xlsx (or at least not the first lines of the file).

Open to other suggestions on how it should work but as it stands I'd have to special case for *.xlsx.gz files which seems to defeat the purpose of a general ReaderFactory that can handle any of the supported formats you throw at it.

Book1.xlsx Book1.xlsx.gz

Update: it looks like the underlying issue is that ReaderFactory's call to TarArchive.IsTarFile returns true for *.xlsx files: https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Readers/ReaderFactory.cs#L48

Kim-SSi commented 5 years ago

@adamhathcock I have been unable to figure out a good way to fix the TarArchive.IsTarFile detection. As part of the proposed fix I moved IsTarFile to the end of the Open. The TarHeader will sometimes accept a file as a Tar in a compressed stream, gz, bz2 etc even when it is not. What are your thoughts on adding an option to ReaderOptions, like TryOpenArchiveInStream? Then make the Open call recursive on a compressed steams. If this is an acceptable solution I am quite happy to create a PR.

https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Readers/ReaderFactory.cs#L29-L105

public static IReader Open(Stream stream, ReaderOptions options = null)
{
    stream.CheckNotNull("stream");
    options = options ?? new ReaderOptions()
    {
        LeaveStreamOpen = false
    };
    RewindableStream rewindableStream = new RewindableStream(stream);
    rewindableStream.StartRecording();
    if (ZipArchive.IsZipFile(rewindableStream, options.Password))
    {
        rewindableStream.Rewind(true);
        return ZipReader.Open(rewindableStream, options);
    }

    rewindableStream.Rewind(false);
    if (GZipArchive.IsGZipFile(rewindableStream))
    {
        rewindableStream.Rewind(false);
        GZipStream decompressedStream = new GZipStream(rewindableStream, CompressionMode.Decompress);
        if (options.TryOpenArchiveInStream)
        {
            try { return Open(decompressedStream, options); }
            catch (InvalidOperationException) { }
        }
        rewindableStream.Rewind(true);
        return GZipReader.Open(rewindableStream, options);
    }

    rewindableStream.Rewind(false);
    if (BZip2Stream.IsBZip2(rewindableStream))
    {
        rewindableStream.Rewind(false);
        BZip2Stream decompressedStream = new BZip2Stream(new NonDisposingStream(rewindableStream), CompressionMode.Decompress, false);
        if (options.TryOpenArchiveInStream)
        {
            try { return Open(decompressedStream, options); }
            catch (InvalidOperationException) { }
        }
    }

    rewindableStream.Rewind(false);
    if (LZipStream.IsLZipFile(rewindableStream))
    {
        rewindableStream.Rewind(false);
        LZipStream decompressedStream = new LZipStream(new NonDisposingStream(rewindableStream), CompressionMode.Decompress);
        if (options.TryOpenArchiveInStream)
        {
            try { return Open(decompressedStream, options); }
            catch (InvalidOperationException) { }
        }
    }

    rewindableStream.Rewind(false);
    if (RarArchive.IsRarFile(rewindableStream, options))
    {
        rewindableStream.Rewind(true);
        return RarReader.Open(rewindableStream, options);
    }

    rewindableStream.Rewind(false);
    if (XZStream.IsXZStream(rewindableStream))
    {
        rewindableStream.Rewind(true);
        XZStream decompressedStream = new XZStream(rewindableStream);
        if (options.TryOpenArchiveInStream)
        {
            try { return Open(decompressedStream, options); }
            catch (InvalidOperationException) { }
        }
    }

    rewindableStream.Rewind(false);
    if (TarArchive.IsTarFile(rewindableStream))
    {
        rewindableStream.Rewind(true);
        return TarReader.Open(rewindableStream, options);
    }
    throw new InvalidOperationException("Cannot determine compressed stream type.  Supported Reader Formats: Zip, GZip, BZip2, Tar, Rar, LZip, XZ");
}