Open elgonzo opened 9 years ago
I totally have forgotten about that issue as it wasn't on github. I'll look into it when I get some free time.
I looked at this today and this is basic cause: https://github.com/adamhathcock/sharpcompress/blob/master/SharpCompress/Common/Zip/StreamingZipHeaderFactory.cs#L50
However, this was to fix an issue with zero byte files in a zip file. I'm not sure I like that fix anyway as it buffers in memory.
Need to think more.
Have you had a chance to think about the solution to this issue, @adamhathcock? We've got a customer (of Octopus Deploy) hitting extraction issues with a nested zip file.
The sample file they provided as available at http://help.octopusdeploy.com/discussions/problems/48574, if you want to take a look.
Thanks in advance, and thanks for a great library!
The basic issue is that for the Streaming Reader case, it's trying to read all entries. To do that, it's scanning the file for something that matches a ZIP file header.
There are possibly two ZIP headers to read: there primary header that's before the data (which may not have a length if it's created using a POST header) and the POST header. POST headers are created when ZIPs are created with a streaming write to avoid knowing the entire file size.
If the file is being skipped and I don't know the length (because the primary header has no length) then I have to just scan bytes until the next file is found. Since there's no compression, then I can find a nested ZIP.
I'm not sure this is a solvable problem because in the POST-header no-compression scenario, a ZipReader doesn't know the current file length ever. So it doesn't know there's a nested ZIP.
Two fixes for this:
ZipArchive uses the trailing Zip directory and should know the file lengths always. If this isn't true then this is an issue that can be fixed (in theory).
Myself, I'm not sure about my own timing to be able to fix this.
Thanks for the detailed info, @adamhathcock.
Turns out we didn't need to use the streaming ZipReader
- ZipArchive
works well for us.
Any progress on solving this? I am also encountering this in common code used to extract 7z, rar, and zip files. If I don't use ZipReader, I get terrible performance on extracting 7z files. If I do use ZipReader, I hit this bug on zip files.
If I special-case the code to implement a workaround and handle zip and 7z differently, that defeats a lot of the purpose of choosing this library - being able to write code once to handle all 3 archive types.
As explained above, this isn’t fixable as it’s not really a bug.
When creating streaming zips with any library, you don’t know the length for the header. Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.
I’m open to suggestions.
In your case, I don’t know what you’re meaning with 7z files because you can’t access them in a forward-only way ever which means there’s no 7zReader. Only 7zArchive with random access.
For the 7z issue, I was referring to the situation described by this SO and this post, where the issue was able to be fixed by using ExtractAllEntries() which results in a reader being created. Perhaps I misspoke about the exact terminology, sorry.
Unfortunately, using ExtractAllEntries() on certain zip files in the wild results in the issue described here. This means that for my program to function correctly, I need to handle 7z files with ExtractAllEntries(), and zip files by iterating the Entries collection directly.
I'm not enough of an expert on archive file formats to be able to offer any advice, hence I am relying on your expertise. It's a shame this can't be addressed somehow, because other than this issue I've been very happy with this library for the last couple of years, especially with its flexibility in handling multiple archive formats.
7z is dumb in your case it's compressing the headers along side the files into a single "entry" to achieve something more efficient. (RAR does this as SOLID) . This can be a good thing for many small files. However, it's not good if you're just trying to extract a single file because it requires you to read and decompress everything.
Honestly, if you just used Zip (non-streaming) with LZMA 1 or 2, you'd get something similar and better for random access.
LZMA is what achieves good compression, not the 7Z format itself. It's a painful format to deal with. LZip with Tar is better and more straight forward as a streamed file.
Sorry to continue this discussion on this issue. Please create a new issue about understanding 7Z better or something.
I realize that both issues could be resolved if I had the luxury of creating the input archives myself. However, in my case, it needs to be able to handle archives created in the wild. 7z can die a fiery death for all I care, but people insist on using it to distribute stuff. Shutting up about 7z now as it is OT except as to how it relates to how I happened across the zip issue.
Hi!
This seems to be related to an issue I am facing. How to reproduce:
Workaround when using SharpCompress: Filter on Reader.Entry.Key using Regex and only add the appropriate entries to the List. However, this assumes there is actually a file naming standard, which might not be the case, which now is an issue as there is seemingly no way of telling if an entry comes from the outer or inner zip-archive.
As my software is getting ZIP-archives from an external system, I can never tell if the archive is going to use LZMA, Deflate or be uncompressed. One way would have been to just use LZMA, but this is not possible in my case.
Would it be possible to tell MoveToNextEntry to include files to a certain sub-level, 0 meaning only go through top-level entries, 1 would be going one step further, and so on?
The basic issue is still:
When creating streaming zips with any library, you don’t know the length for the entry in the starting header (it's in the tail header). Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.
Fixes the issue we had at least, does not break any tests
StreamingZipHeaderFactory.cpp
//entry could be zero bytes so we need to know that.
if (header.ZipHeaderType == ZipHeaderType.LocalEntry)
{
// Do we a size
if(((LocalEntryHeader)header).CompressedSize > 0)
{
header.HasData = true;
}
else // Try detect next header
{
bool isRecording = rewindableStream.IsRecording;
if (!isRecording)
{
rewindableStream.StartRecording();
}
uint nextHeaderBytes = reader.ReadUInt32();
header.HasData = !IsHeader(nextHeaderBytes);
rewindableStream.Rewind(!isRecording);
}
}
Any chance for a pull request?
Pull request #500 created, first time for everything, I hope I didn't mess up things too much.
Thanks!
The issue has already been documented on CodePlex (see here). I added my comments and an example ZIP file exhibiting the problem to this issue in CodePlex, trying to keep the information in one place.
A short summary: If a ZIP archive contains an uncompressed file entry which happens to be file in the ZIP format (such as another ZIP archive, epub, docx, xlsx document, etc...) then trying to extract the ZIP archive using ReaderFactory.Open(...) and IReader.OpenEntryStream(...) will fail.
I am using the SharpCompress version 0.11.1 available on nuget.