AmsterdamUMC / AmsterdamUMCdb

AmsterdamUMCdb - Freely Accessible ICU database. Please access our Open Access manuscript at https://doi.org/10.1097/CCM.0000000000004916
https://amsterdammedicaldatascience.nl/
MIT License
154 stars 43 forks source link

Issues extracting zip Archive #13

Closed nbenn closed 3 years ago

nbenn commented 3 years ago

Is it just me or are other people also having problems extracting the zip archive distributed via the filesender instance at https://filesender.surf.nl? On macOS 10.15.6, using zipinfo v3.00 I get

❯ zipinfo AmsterdamUMCdb-v1.0.2.zip
Archive:  AmsterdamUMCdb-v1.0.2.zip
Zip file size: 9143127113 bytes, number of entries: 7
warning [AmsterdamUMCdb-v1.0.2.zip]:  4848159318 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [AmsterdamUMCdb-v1.0.2.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

I'm sorry if I'm reporting this issue in the wrong place and I'm happy to be redirected.

patrickthoral commented 3 years ago

The most likely reason is an incompatibility with the Mac’s default extraction program, 'Archive Utility', which does not support files over 4 GB. Other extraction utilities (e.g. Commander One/WinZip for Mac) should be fine.

nbenn commented 3 years ago

@patrickthoral Thanks for getting in touch. I don't believe the macOS Archive Utility has anything to do with this, most of all due to the fact that the issue I'm describing shows up when running zipinfo from the command line.

Furthermore, I do not think the problem is limited to macOS. I can reproduce the issue under CentOS 7 for example, again running zipinfo v3.0.0

[nbennett@eu-login-18 aumc]$ zipinfo AmsterdamUMCdb-v1.0.2.zip
Archive:  AmsterdamUMCdb-v1.0.2.zip
Zip file size: 9143127113 bytes, number of entries: 7
warning [AmsterdamUMCdb-v1.0.2.zip]:  4848159318 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [AmsterdamUMCdb-v1.0.2.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

If you want me to, I can also check on Fedora, but I honestly do not believe this is an OS issue in that sense, but rather that there is an issue with how the zip file was created.

patrickthoral commented 3 years ago

I do not think there is a problem with the OS itself but the common source most unzip utilities are based upon. I could reproduce the same error a colleague had on a mac. The work around was to use Commander One or WinZip for mac. The file was created on a Windows system with the built-in archiving tools. For the next version, we'll check if there's a format/setting that won't mess up the default archiving tools on nix based systems. If it won't extract at all, there is* probably a transfer error.

nbenn commented 3 years ago

The most likely reason is an incompatibility with the Mac’s default extraction program, 'Archive Utility', which does not support files over 4 GB. Other extraction utilities (e.g. Commander One/WinZip for Mac) should be fine.

For an enlightening SO post on this, see https://stackoverflow.com/a/59518097/3855417.

While you're correct that there still is an issue on macOS 10.15 when trying to create ZIP64 files using Archive Utility, the issue I'm reporting is not affected by this. As stated above, I'm on an Infozip 6.0 toolchain which does support extraction of proper ZIP64 files.

If it won't extract at all, there is probably a transfer error.

If you provide me with a file hash, I'm happy to check. But I'm pretty sure I have the complete file. I also believe that the zip archive you're currently distributing is non-conformant with the ZIP64 specification and therefore extraction will fail for all extraction utilities that are strict about this, such as the default unzip program on many Unix platforms. Are you positive that your zip program is using ZIP64 extensions (which is required to create a compliant zip archive containing files of this size)?

The work around was to use Commander One or WinZip for mac.

Unfortunately this does not work for my use-case. I'm trying to build a cross-platform pipeline for setting up the AUMC database. 7zip does extract the archive successfully (with warnings) but adding this as a dependency simply for extracting this one file seems unreasonable to me.

If you are planning on putting this off until a next release, do you have an eta on that?

patrickthoral commented 3 years ago

You are are right this is a non-conformance issues, but on the part of those other tools. What happens is that zipinfo uses the Central End Record, ZIP64 Central End Record and ZIP64 Central End Locator incorrectly (not based on version 2 of the ZIP64 specification).

The official PKWARE (the developers of the standard) tools work fine with this file created by the licensed Windows Compressed Folders (part of Windows). In addition, the Python ZipFile module can also display the directory listing fine, however it does not support Deflate64, so extracting is not possible.

Indeed, a problem with the Zip standard is that it's implementations is not open source at all, but PKWARE proprietary technology and no official open source version exists. The only reason it exists today is because it has been in use for decades (since the MS-DOS era) and ended up in (licensed) technology (a de facto standard).

I will use the cross-platform ZipFile library for the next iteration (that will also imply using the better supported Deflate-algorithm as well), but there's no ETA as of yet. I don't understand though, what you mean that it sounds unreasonable to add that dependency. You are not allowed to distribute the files anyway to other users, so it's a one-time extraction.

nbenn commented 3 years ago

@patrickthoral Thanks for looking into making extraction easier cross-platform.

You are not allowed to distribute the files anyway to other users

Obviously I'm not planning on distributing your data. I'm planning on distributing a pipeline in order to make obtaining results using your data (together with other datasets) more reproducible and (hopefully) easier to access. It is for such a pipeline where I'm trying to keep the number of dependencies as small as possible.

patrickthoral commented 3 years ago

The files have been rezipped using the Deflate algorithm instead of the Deflate64 algorithm with the python ZipFile library. I've verified it to work on Windows, MacOS and Ubuntu with the built-in tools so should be safe to use in most environments. I'll notify you when the new file is available for download.

jsassenscheidt commented 3 years ago

I had a similar issue with unzip on Debian-Linux. A workaround is to repair the original zip file:

zip -FF AmsterdamUMCdb-v1.0.2.zip --out AmsterdamUMCdb-v1.0.2_repaired.zip -fz

Afterwards, extracting the new file with unzip works without errors:

unzip AmsterdamUMCdb-v1.0.2_repaired.zip

Best regards, Julian

patrickthoral commented 3 years ago

@jsassenscheidt @nbenn Indeed, most open source implementations have problems reading the directory (but interestingly not Python's zipfile library). The rezipped file has been uploaded to our transfer system, so I expect the file to be available for download for credentialed users in the next couple of days. Python's implementation sadly misses a callback to determine progress when (un)zipping, which is unfortunate when handling large files, so if anybody is interested, i added some sample code in the tools folder to improve this.

nbenn commented 3 years ago

@patrickthoral Thanks a lot for looking into this so swiftly. I'm happy to check it out. Just to clarify, did you bump the version number? If I have a file AmsterdamUMCdb-v1.0.2.zip for download, does that mean, the new file has not propagated through?

patrickthoral commented 3 years ago

The version number stays the same (the data has not changed at all), but the new file should be available from DANS as of now.