DILCISBoard / SIARD

SIARD (Software Independent Archiving of Relational Databases) - an open file format for the long-term archiving of relational databases
8 stars 0 forks source link

ZIP64 - general purpose bit #8

Open sfa-siard opened 5 years ago

sfa-siard commented 5 years ago

Hi

although this question is not about the SIARD specification itself, it is about ZIP64 (which is used by SIARD) and the interpretation of the ZIP64 specification. As we (SFA, KOST (developer of KOST-Val) and Hartwig Thomas (developer of SIARD Suite) have not been able to reach a common view on the issue, we would like your input.

Brief summary: During development and bugfixing some changes have been made to the Zip64 library used by Siard Suite. These seem to be in line with the .ZIP File Format Specification [1] However, the validation Software KOST-Val now produced an error [2] of the following type:

header validation | An unknown error has occurred. invalid size (expected from local header/data descriptor 0 but actually found 15703 bytes)

We have then investigated further. It seems, that most Zip-Softwares do not have a problem with the new implementation used by Siard Suite. The following applications were able to open/extract the zip-files without any issues: 7-Zip, IZarc, PKZIP However Java.util.zip is not able to unpack/open these Zip-files and throws the error seen above.

Because of this, we are not sure of the changes made by Siard Suite are correct. Maybe most zip-applications are fault-tolerant here. Or maybe it is simply a bug in Java.util.zip.

We would really appriciate your feedback on this.

Best, Marcel


References: [1] - Siard Suite / Zip64File Issue: Wrong size in ZIP64 extended information extra field of local file header [2] - KOST-Val Issue: SIARD 2.1 .ZIP File Format Specification

andersbonielsen commented 5 years ago

Hello Marcel, I will ask Miguel @jmaferreira from DILCIS GitHub to let Bruno have a look at it for the Java implementation, esp. Java.util.zip. Are you using Open JDK or Oracle Java or both? I will also have a look at it together with my collegue René regarding the spec. and .Net implementations. I recall that we had problems with ZIP64 and certain zip libraries, open source and closed source for .Net, when we examined libraries and tools last time. Since then the appnote has also been updated. (For preservation purposes fault-tolerant tools are problemtic in the long run, esp. if they do not inform you about the error that they have managed to get by. What do you do, if the migration tool you have to use 20 years later only supports the strict version of the standard? We have seen this problem with TIFF and esp. with PDF.) In the meantime, could you provide us with a problematic SIARD file for testing purposes? And by the way, are we communicating in German or English or both?

jmaferreira commented 5 years ago

I pass this on to Bruno, but nonetheless we need to know which software has produced the ZIP file?

sfa-siard commented 5 years ago

Hi

example files can be found here: https://github.com/sfa-siard/SiardGui/blob/master/testfiles/sample.siard (e.g. sample.siard). These files were created using Siard Suite, or to be more specific its library Zip64File: https://github.com/sfa-siard/Zip64File

You can use KOST-Val (https://github.com/KOST-CECO/KOST-Val/releases/tag/v1.9.3) to reproduce the issue.

Best, Marcel

andersbonielsen commented 5 years ago

The sample file sample.siard does not seem to be a ZIP64 file. According to my hex viewer and the appnote 4.3.7 Local file header the value of byte 4 and 5 is hex 14 00 equal to dec 20 representing version 2.0. According to 4.4.3 version needed to extract (2 **bytes)** it req. version 4.5:

4.5 - File uses ZIP64 format extensions.

Apart from this the value of 4.4.4 general purpose bit flag: (2 bytes) is binary 00000000 00001000, i.e bit 3 is set, and therefore :

Bit 3: If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data.

samplesiardhexview

By the way the file is valid according to PZKIP: log for test of sample siard

andersbonielsen commented 5 years ago

We have not had time to investigate further yet, but I can recommend https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.html#archivedata as well as the comments from the other Buchholz (Martin) about the difficulties in interpretering the ZIP format, as this comment reg. Open JDK shows: Martin Buchholz added a comment - 2017-08-18 18:46 We happened to be looking at this code recently and observed it to be incorrect ... because a zip implementation can add a ZIP64 EOCD header for any reason whatsoever. You always have to look for it! But because the last entry might happen to contain data that looks like a ZIP64 EOCD, you had better do a lot of validation on it - check multiple fields (not just the signature) and if anything looks bad, fall back to assuming this was a false trail. yes, reading zip files involves heuristics! appnote.txt has the strong hint """However ZIP64 format may be used regardless of the size of a file"""

andersbonielsen commented 5 years ago

Our interest in DILCIS with the interpretion of the ZIP format is not only due to SIARD, but also since we according to the common specification has this CAN:

CSIPSTR3: The Information Package root folder CAN be compressed (for example by using TAR or ZIP).

If this CAN is used by the producer of an Information Package I expect it will often be ZIP64, giving us the need for validating the ZIP64 format.