Closed nbenn closed 3 years ago
The most likely reason is an incompatibility with the Mac’s default extraction program, 'Archive Utility', which does not support files over 4 GB. Other extraction utilities (e.g. Commander One/WinZip for Mac) should be fine.
@patrickthoral Thanks for getting in touch. I don't believe the macOS Archive Utility has anything to do with this, most of all due to the fact that the issue I'm describing shows up when running zipinfo
from the command line.
Furthermore, I do not think the problem is limited to macOS. I can reproduce the issue under CentOS 7 for example, again running zipinfo
v3.0.0
[nbennett@eu-login-18 aumc]$ zipinfo AmsterdamUMCdb-v1.0.2.zip
Archive: AmsterdamUMCdb-v1.0.2.zip
Zip file size: 9143127113 bytes, number of entries: 7
warning [AmsterdamUMCdb-v1.0.2.zip]: 4848159318 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [AmsterdamUMCdb-v1.0.2.zip]: start of central directory not found;
zipfile corrupt.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
If you want me to, I can also check on Fedora, but I honestly do not believe this is an OS issue in that sense, but rather that there is an issue with how the zip file was created.
I do not think there is a problem with the OS itself but the common source most unzip utilities are based upon. I could reproduce the same error a colleague had on a mac. The work around was to use Commander One or WinZip for mac. The file was created on a Windows system with the built-in archiving tools. For the next version, we'll check if there's a format/setting that won't mess up the default archiving tools on nix based systems. If it won't extract at all, there is* probably a transfer error.
The most likely reason is an incompatibility with the Mac’s default extraction program, 'Archive Utility', which does not support files over 4 GB. Other extraction utilities (e.g. Commander One/WinZip for Mac) should be fine.
For an enlightening SO post on this, see https://stackoverflow.com/a/59518097/3855417.
While you're correct that there still is an issue on macOS 10.15 when trying to create ZIP64
files using Archive Utility, the issue I'm reporting is not affected by this. As stated above, I'm on an Infozip 6.0 toolchain which does support extraction of proper ZIP64
files.
If it won't extract at all, there is probably a transfer error.
If you provide me with a file hash, I'm happy to check. But I'm pretty sure I have the complete file. I also believe that the zip archive you're currently distributing is non-conformant with the ZIP64
specification and therefore extraction will fail for all extraction utilities that are strict about this, such as the default unzip
program on many Unix platforms. Are you positive that your zip program is using ZIP64
extensions (which is required to create a compliant zip archive containing files of this size)?
The work around was to use Commander One or WinZip for mac.
Unfortunately this does not work for my use-case. I'm trying to build a cross-platform pipeline for setting up the AUMC database. 7zip does extract the archive successfully (with warnings) but adding this as a dependency simply for extracting this one file seems unreasonable to me.
If you are planning on putting this off until a next release, do you have an eta on that?
You are are right this is a non-conformance issues, but on the part of those other tools. What happens is that zipinfo uses the Central End Record
, ZIP64 Central End Record
and ZIP64 Central End Locator
incorrectly (not based on version 2 of the ZIP64
specification).
The official PKWARE (the developers of the standard) tools work fine with this file created by the licensed Windows Compressed Folders (part of Windows). In addition, the Python ZipFile module can also display the directory listing fine, however it does not support Deflate64, so extracting is not possible.
Indeed, a problem with the Zip standard is that it's implementations is not open source at all, but PKWARE proprietary technology and no official open source version exists. The only reason it exists today is because it has been in use for decades (since the MS-DOS era) and ended up in (licensed) technology (a de facto standard).
I will use the cross-platform ZipFile library for the next iteration (that will also imply using the better supported Deflate-algorithm as well), but there's no ETA as of yet. I don't understand though, what you mean that it sounds unreasonable to add that dependency. You are not allowed to distribute the files anyway to other users, so it's a one-time extraction.
@patrickthoral Thanks for looking into making extraction easier cross-platform.
You are not allowed to distribute the files anyway to other users
Obviously I'm not planning on distributing your data. I'm planning on distributing a pipeline in order to make obtaining results using your data (together with other datasets) more reproducible and (hopefully) easier to access. It is for such a pipeline where I'm trying to keep the number of dependencies as small as possible.
The files have been rezipped using the Deflate algorithm instead of the Deflate64 algorithm with the python ZipFile library. I've verified it to work on Windows, MacOS and Ubuntu with the built-in tools so should be safe to use in most environments. I'll notify you when the new file is available for download.
I had a similar issue with unzip on Debian-Linux. A workaround is to repair the original zip file:
zip -FF AmsterdamUMCdb-v1.0.2.zip --out AmsterdamUMCdb-v1.0.2_repaired.zip -fz
Afterwards, extracting the new file with unzip works without errors:
unzip AmsterdamUMCdb-v1.0.2_repaired.zip
Best regards, Julian
@jsassenscheidt @nbenn Indeed, most open source implementations have problems reading the directory (but interestingly not Python's zipfile library). The rezipped file has been uploaded to our transfer system, so I expect the file to be available for download for credentialed users in the next couple of days. Python's implementation sadly misses a callback to determine progress when (un)zipping, which is unfortunate when handling large files, so if anybody is interested, i added some sample code in the tools folder to improve this.
@patrickthoral Thanks a lot for looking into this so swiftly. I'm happy to check it out. Just to clarify, did you bump the version number? If I have a file AmsterdamUMCdb-v1.0.2.zip
for download, does that mean, the new file has not propagated through?
The version number stays the same (the data has not changed at all), but the new file should be available from DANS as of now.
Is it just me or are other people also having problems extracting the zip archive distributed via the
filesender
instance athttps://filesender.surf.nl
? On macOS 10.15.6, usingzipinfo
v3.00 I getI'm sorry if I'm reporting this issue in the wrong place and I'm happy to be redirected.