adamhathcock / sharpcompress

SharpCompress is a fully managed C# library to deal with many compression types and formats.
MIT License
2.28k stars 480 forks source link

ZIP archive file entries with an "data descriptor structure" will confuse ZipReader #88

Open elgonzo opened 9 years ago

elgonzo commented 9 years ago

When a ZIP archive file entry has a data descriptor structure following its compressed file data, then ZipReader will falsely report the CRC and file size for this entry being zero. This in itself is more an inconvenience than an error when considering "streaming" of an ZIP archive. Note that it is still possible to obtain the decompressed file data of such a file entry by reading its EntryStream (ZipReader.OpenEntryStream()) until the end of the EntryStream.

However, calling ZipReader.MoveToNextEntry() without reading the EntryStream of such a file entry will upset the ZipReader and make it seek to some arbitrary position in the ZIP file. It will read 4 bytes at this position, expecting to find a local file header signature (i guess). Since those 4 bytes at this arbitrary file position will not be a valid signature, the ZipHeaderFactory.ReadHeader(...) method will throw a NotSupportedException telling: "Unknown header: <random number>".

I have seen a few reports about NotSupportedExceptions telling "Unknown header: <some random number>". Although i cannot be sure what caused the NotSupportedExceptions in those cases, it is certainly a possibility that they might have been caused by the problem i explain here.

What i believe ZipReader should do:

ZipReader can check for "Crc-32" and "Compressed size" fields being zero. If that is the case and this file entry should be skipped (instead of being extracted), then ZipReader could (A) check the compression mode and/or if a signature is following this file entry -- which would indicate a zero-byte. If the entry has not been identified as a zero-byte file, then (B) ZipReader can attempt decompressing the file data in memory to get to the end of the compressed data and thus reaching the optional data descriptor of this entry or the local file header of the next archive entry.

Background info:

The ".ZIP File Format Specification" contains more information with regard to data descriptor structures. Especially the following chapters are worth a read:

4.3.7 Local file header 4.3.9 Data descriptor 4.4.4 General purpose bit flag, Bit 3 4.4.7 CRC-32 4.4.8 compressed size 4.4.9 uncompressed size

Link to the ".ZIP File Format Specification": https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Also pay attention to the paragraphs 4.3.5 (a data descriptor structure may be present even if general purpose bit 3 is not set), 4.3.9.3 (optional data descriptor signature 0x08074b50) and 4.3.9.6 (data descriptor and central directory encryption).

Remarks ZipArchive and its ZipArchive.Entries enumeration do not seem to be affected by this issue.

The ZIP archive i found to have file entries as described above is about 300MB in size. This is obviously too large for uploading it as a sample. I will provide a small ZIP file with the described file entries as soon as i managed to produce one myself ;)

elgonzo commented 9 years ago

Okay, i managed to produce a small ZIP archive. It can be downloaded from:

URL: http://wikisend.com/download/589558/test_data_descriptor.zip Password: adamhathcock

The ZIP file contains two files:

(I used Info-ZIP's command line utility zip.exe 3.0 to create this file, which explains why the file name of one of the file entries is just a dash...)

Look at the local file header of the first file ("-"). It has the general purpose bit 3 set, and the fields "crc-32", "compressed size" and "uncompressed size" are zeroed. As required by the general purpose bit 3 being set, this file entry has a data descriptor following its file data.

Also interesting is the local file header of the second file entry "second.txt". It has the general purpose bit 3 set too and has a data descriptor as well, but notice that only the fields "crc-32" and "compressed size" are zeroed, whereas the field "uncompressed size" is not zero (it contains the actual correct uncompressed file size for this entry). If the ZIP file format specification is followed to the letter, then this local file header is actually violating the specification. One has to assume that Info-ZIP is not the only software which could create such local file headers...

Note that this small ZIP file will not produce a NotSupportedException as described in my report above, but rather an EndOfStreamException. I guess the arbitrary stream position the ZipReader wants to jump to after getting confused is beyond the end of the zip archive file, which would explain the different exception i observed when testing ZipReader with this small ZIP archive.

Some boring tidbits about how i created the ZIP file

There are basically two ways to create file entries with data descriptors using Info-ZIP's zip utility.

The first way is to use the "-fd" command line switch, which will enforce data descriptors and sets the general purpose bit 3 on the affected archive entries. I used this switch to add "second.txt" to the archive. However, as i explained, the Info-ZIP zip utility forgets to set the "compressed size" field to zero. And i wanted to get an archive entry where "compressed size" is properly set to zero.

The other way is to provide the data to be comressed via stdin. In this case, Info-ZIP's zip utility will also use a data descriptor and set the general purpose bit 3 for the resulting archive entry. It will also properly zero out "crc-32", "compressed size" as well as "uncompressed size".

Hence, i used the following command line to create the small test ZIP file:

type first.txt | zip -fd -fz- test_data_descriptor.zip - second.txt


General purpose bit 3 and uncompressed file entries

The ZIP file format specification mentions about the general purpose bit 3:

If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data. (Note: PKZIP version 2.04g for DOS only recognizes this bit for method 8 compression, newer versions of PKZIP recognize this bit for any compression method.)

The remark that the data descriptor has to follow the compressed data when the general purpose bit 3 is set means in consequence that setting general purpose bit 3 for uncompressed file entries is not allowed (as there would be no compressed data...).

This means, encountering a local file header where the "crc-32", "compressed size" and/or "uncompressed size" fields are zero, it should be sufficient to check the compression mode and the general purpose bit 3 to know whether this entry represents a zero-byte file or whether the entries size and CRC values are to be found in a data descriptor following the compressed data...

adamhathcock commented 9 years ago

Thanks for this info. My brain is baby fried so I'll have to look at this a bit later. Just wanted you to know I'm not ignoring you.

elgonzo commented 9 years ago

Don't worry. I am not expecting you to start rushing just because i wrote something. I am sure Github will be around for quite some time and so will the stuff i wrote... :)

In case you will not find time to look at the issue in the next weeks, you might still want to grab and make a backup of that small ZIP file i mentioned in my second comment. The hosting site (Wikisend) will delete it after 90 days. (Sorry, i forgot to mention this earlier...)

mewalig commented 8 years ago

could you repost the file?

mewalig commented 8 years ago

nm, made my own thanks to your helpful notes. I'm on osx, used:

cat myfile.txt | zip -fz- target.zip -
mewalig commented 8 years ago

Shucks, I was hoping that would create a zip file with General purpose bit 3 set, but it looks like it doesn't...

elgonzo commented 8 years ago

Here the ZIP file again: test_data_descriptor.zip

Just FYI, when creating the ZIP file i also used the the command line parameter -fd which enforces usage of data descriptors. Not sure whether the ZIP tool on OSX provides this parameter, but i noticed that you didn't use it when creating your ZIP file (which could explain why your ZIP tool did not choose to use data descriptors based on whatever reasons and circumstances)