adamhathcock / sharpcompress

SharpCompress is a fully managed C# library to deal with many compression types and formats.
MIT License
2.27k stars 480 forks source link

ZipReader/StreamingZipReaderFactory fails for archive entries which are uncompressed files in ZIP format #86

Open elgonzo opened 9 years ago

elgonzo commented 9 years ago

The issue has already been documented on CodePlex (see here). I added my comments and an example ZIP file exhibiting the problem to this issue in CodePlex, trying to keep the information in one place.

A short summary: If a ZIP archive contains an uncompressed file entry which happens to be file in the ZIP format (such as another ZIP archive, epub, docx, xlsx document, etc...) then trying to extract the ZIP archive using ReaderFactory.Open(...) and IReader.OpenEntryStream(...) will fail.

I am using the SharpCompress version 0.11.1 available on nuget.

adamhathcock commented 9 years ago

I totally have forgotten about that issue as it wasn't on github. I'll look into it when I get some free time.

adamhathcock commented 9 years ago

I looked at this today and this is basic cause: https://github.com/adamhathcock/sharpcompress/blob/master/SharpCompress/Common/Zip/StreamingZipHeaderFactory.cs#L50

However, this was to fix an issue with zero byte files in a zip file. I'm not sure I like that fix anyway as it buffers in memory.

Need to think more.

matt-richardson commented 7 years ago

Have you had a chance to think about the solution to this issue, @adamhathcock? We've got a customer (of Octopus Deploy) hitting extraction issues with a nested zip file.

The sample file they provided as available at http://help.octopusdeploy.com/discussions/problems/48574, if you want to take a look.

Thanks in advance, and thanks for a great library!

adamhathcock commented 7 years ago

The basic issue is that for the Streaming Reader case, it's trying to read all entries. To do that, it's scanning the file for something that matches a ZIP file header.

There are possibly two ZIP headers to read: there primary header that's before the data (which may not have a length if it's created using a POST header) and the POST header. POST headers are created when ZIPs are created with a streaming write to avoid knowing the entire file size.

If the file is being skipped and I don't know the length (because the primary header has no length) then I have to just scan bytes until the next file is found. Since there's no compression, then I can find a nested ZIP.

I'm not sure this is a solvable problem because in the POST-header no-compression scenario, a ZipReader doesn't know the current file length ever. So it doesn't know there's a nested ZIP.

Two fixes for this:

ZipArchive uses the trailing Zip directory and should know the file lengths always. If this isn't true then this is an issue that can be fixed (in theory).

Myself, I'm not sure about my own timing to be able to fix this.

matt-richardson commented 7 years ago

Thanks for the detailed info, @adamhathcock.

Turns out we didn't need to use the streaming ZipReader - ZipArchive works well for us.

SmilingCatEntertainment commented 6 years ago

Any progress on solving this? I am also encountering this in common code used to extract 7z, rar, and zip files. If I don't use ZipReader, I get terrible performance on extracting 7z files. If I do use ZipReader, I hit this bug on zip files.

If I special-case the code to implement a workaround and handle zip and 7z differently, that defeats a lot of the purpose of choosing this library - being able to write code once to handle all 3 archive types.

adamhathcock commented 6 years ago

As explained above, this isn’t fixable as it’s not really a bug.

When creating streaming zips with any library, you don’t know the length for the header. Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.

I’m open to suggestions.

In your case, I don’t know what you’re meaning with 7z files because you can’t access them in a forward-only way ever which means there’s no 7zReader. Only 7zArchive with random access.

SmilingCatEntertainment commented 6 years ago

For the 7z issue, I was referring to the situation described by this SO and this post, where the issue was able to be fixed by using ExtractAllEntries() which results in a reader being created. Perhaps I misspoke about the exact terminology, sorry.

Unfortunately, using ExtractAllEntries() on certain zip files in the wild results in the issue described here. This means that for my program to function correctly, I need to handle 7z files with ExtractAllEntries(), and zip files by iterating the Entries collection directly.

I'm not enough of an expert on archive file formats to be able to offer any advice, hence I am relying on your expertise. It's a shame this can't be addressed somehow, because other than this issue I've been very happy with this library for the last couple of years, especially with its flexibility in handling multiple archive formats.

adamhathcock commented 6 years ago

7z is dumb in your case it's compressing the headers along side the files into a single "entry" to achieve something more efficient. (RAR does this as SOLID) . This can be a good thing for many small files. However, it's not good if you're just trying to extract a single file because it requires you to read and decompress everything.

Honestly, if you just used Zip (non-streaming) with LZMA 1 or 2, you'd get something similar and better for random access.

LZMA is what achieves good compression, not the 7Z format itself. It's a painful format to deal with. LZip with Tar is better and more straight forward as a streamed file.

Sorry to continue this discussion on this issue. Please create a new issue about understanding 7Z better or something.

SmilingCatEntertainment commented 6 years ago

I realize that both issues could be resolved if I had the luxury of creating the input archives myself. However, in my case, it needs to be able to handle archives created in the wild. 7z can die a fiery death for all I care, but people insist on using it to distribute stuff. Shutting up about 7z now as it is OT except as to how it relates to how I happened across the zip issue.

andewall1175 commented 5 years ago

Hi!

This seems to be related to an issue I am facing. How to reproduce:

  1. Create a zip-archive, containing at least one file. Let it be uncompressed.
  2. Create another uncompressed zip-archive, and add the zip-archive created in step 1 to the new zip-archive.
  3. Now open a nonseekable stream towards the zip-archive, so you must use ZipReader (meaning getting entries with ZipArchive is not an alternative).
  4. Use ReaderFactory.Open to open the zip-archive created in step 2.
  5. Use the resulting IReader to call MoveNextEntry() in a loop in order to go through all entries and add the entries to a List.
  6. Check the Count property on the List - it will now include the entries from both zip-archives.
  7. Now perform step 1-6 but use LZMA as compression method for the ZIP-archive containing the inner ZIP-archive. The List will now contain one entry, the inner archive.

Workaround when using SharpCompress: Filter on Reader.Entry.Key using Regex and only add the appropriate entries to the List. However, this assumes there is actually a file naming standard, which might not be the case, which now is an issue as there is seemingly no way of telling if an entry comes from the outer or inner zip-archive.

As my software is getting ZIP-archives from an external system, I can never tell if the archive is going to use LZMA, Deflate or be uncompressed. One way would have been to just use LZMA, but this is not possible in my case.

Would it be possible to tell MoveToNextEntry to include files to a certain sub-level, 0 meaning only go through top-level entries, 1 would be going one step further, and so on?

adamhathcock commented 5 years ago

The basic issue is still:

When creating streaming zips with any library, you don’t know the length for the entry in the starting header (it's in the tail header). Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.

Erior commented 4 years ago

Fixes the issue we had at least, does not break any tests

StreamingZipHeaderFactory.cpp

//entry could be zero bytes so we need to know that. if (header.ZipHeaderType == ZipHeaderType.LocalEntry) { // Do we a size if(((LocalEntryHeader)header).CompressedSize > 0) { header.HasData = true; } else // Try detect next header { bool isRecording = rewindableStream.IsRecording; if (!isRecording) { rewindableStream.StartRecording(); } uint nextHeaderBytes = reader.ReadUInt32(); header.HasData = !IsHeader(nextHeaderBytes); rewindableStream.Rewind(!isRecording); } }

adamhathcock commented 4 years ago

Any chance for a pull request?

Erior commented 4 years ago

Pull request #500 created, first time for everything, I hope I didn't mess up things too much.

adamhathcock commented 4 years ago

Thanks!