Majored / rs-async-zip

An asynchronous ZIP archive reading/writing crate.
MIT License
123 stars 40 forks source link

Incorrect Zip64 Header Implementation #105

Closed Eric-Kobayashi closed 10 months ago

Eric-Kobayashi commented 10 months ago

Summary

The async_zip rust library appears to write an incorrect Zip64 header, which is causing problems when dealing with zip files larger than 4GB.

Expected Behaviour

The async_zip library should generate zip files with correct Zip64 headers, ensuring compatibility with a range of extraction tools, including older versions of 7zip.

Actual Behaviour

The library's incorrect Zip64 header implementation leads to extraction failure when using default version of 7zip (16.02) on ubuntu system.

Reproduction Steps

pub async fn zip_folder( files: Vec, input_path: &Path, target: &str, ) -> Result<()> let algorithm = Compression::Deflate; let mut writer = ZipFileWriter::new(File::create(Path::new(&target)).await?); let input_dir_str = input_path.as_os_str().to_str().ok_or(anyhow!("Input path not valid UTF-8."))?; let file_count = files.len();

for (i, entry) in files.iter().enumerate() { let entry_path = entry.as_path(); let entry_str = entry_path.as_os_str().to_str().ok_or(anyhow!("Directory file path not valid UTF-8."))?; let name = &entry_str[input_dir_str.len() + 1..]; let mut input_file = File::open(entry_path).await?; let mut buffer = Vec::new(); input_file.read_to_end(&mut buffer).await?; let entry_options = EntryOptions::new(name.into(), algorithm).unix_permissions(0o755); writer.write_entry_whole(entry_options, &buffer).await?; } writer.close().await?; Ok(()) }

let source_location = Path::new("test_folder"); let input_pathbuf = source_location.canonicalize().maperr(|| "Unable to canonicalise input path.")?; let files: Vec = ...// get all files in source_location, input files are 4GB+ in total size let target_location: &str = "test_folder.zip" zip_folder(files, input_pathbuf.as_path(), &target_location).await?

- Attempt to extract the resulting zip file using 7zip 16.02 (installed on ubuntu using `apt install p7zip-full`)
```sh
7z x <zip_file>

Scanning the drive for archives: 1 file, 10977219364 bytes (11 GiB)

Extracting archive: test_folder.zip ERROR: test_folder.zip Can not open the file as archive

Can't open as archive: 1 Files: 0 Size: 0 Compressed: 0


- Attempt to extract the resulting zip file using 7zip 22.01 (installed on alpine linux using `apk install 7zip`)
```sh
7z x <zip_file>

Scanning the drive for archives: 1 file, 10977219364 bytes (11 GiB)

Extracting archive: test_folder.zip

ERRORS: Headers Error


# Additional Information:
- The Mac system utility can successfully extract files despite the incorrect header.
- The issue seems limited to header problems and doesn't result in data loss.

# Attempts to Resolve:
- Initial test uses version 0.0.8, which leads to the below warning when unzipping with 7z ver22.01:

7-Zip (z) 22.01 (arm64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
64-bit arm_v:8 locale=C.UTF-8 Threads:6

Scanning the drive for archives: 1 file, 4665988600 bytes (4450 MiB) Extracting archive: test_folder.zip

WARNINGS: 32-bit overflow in headers


Upgrading `async_zip` to 0.0.15 leads to a change in behaviour where the header warning becomes a header error.

- Experimentation with different compression algorithms offered by the async_zip library doesn't alleviate the issue. 
    - The issue is fundamentally related to the Zip64 header within the zip file format, rather than the compression algorithm itself.
chirok11 commented 10 months ago

Today had the same issue, got an broken zip archive, and at least async_zip is unable to unarchive this archive. Good thing is I could unarchive it within 7z and restore it.

St4NNi commented 10 months ago

Hi, i had a similar issue and tracked it down to missing Zip64 EOCDR and EOCDL at the end of the file (The actual headers for data sections are correct).

The problem is that ZipFileWriter is not automatically setting is_zip64 when the cumulative file size is above 4GB.

As a workaround you can use: force_zip64 like so writer = writer.force_zip64(); and the archive should work just fine.

I've opened a PR (#106) to fix this issue.

Eric-Kobayashi commented 10 months ago

Hi @St4NNi thanks a lot for your suggestion. I've tried the writer = writer.force_zip64(); workaround but it results in a zip file that has some files with header error, when trying to unzip it:

Sub items Errors: 1613

These 1613 files were created but have no content as a result. This did not happen before implementing the workaround.

I've tried to pull your branch from this commit implemented in the PR https://github.com/Majored/rs-async-zip/pull/106 https://github.com/St4NNi/rs-async-zip/commit/1466a6c97c4c9407aa00e7247dba6f61cef0ad7d with the same outcome as the above workaround.

### Some additional information that may be helpful:
- My test folder is 7.6 GB and after zipping it reduces to 4.6 GB. It contains 18000+ files.
- I've also tried to zip the folder with Mac system utility and observe the below when running `7z l <zip file>` with 7zip 22.01:
zip file generated by Mac system utility:

7-Zip (z) 22.01 (arm64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15 64-bit arm_v:8 locale=C.UTF-8 Threads:6

Scanning the drive for archives: 1 file, 4652012039 bytes (4437 MiB)

Listing archive: mac-test.zip

-- Path = mac-test.zip Type = zip Physical Size = 4652012039 64-bit = + Characteristics = Zip64

zip file generated by async_zip after the workaround (or the fix in #106 )

7-Zip (z) 22.01 (arm64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15 64-bit arm_v:8 locale=C.UTF-8 Threads:6

Scanning the drive for archives: 1 file, 4665989723 bytes (4450 MiB)

Listing archive: mac_async_zip_zip64_fix.zip

-- Path = mac_async_zip_zip64_fix.zip Type = zip Physical Size = 4665989723 64-bit = + Characteristics = Zip64 Unsorted_CD


#### Observation: 
- Compared to the correctly zipped file, there is an additional `Unsorted_CD` flag in the async_zip generated zip. 
- Compared to the zip file before the workaround or the fix in #106, the `ERRORS: Headers Error` no longer occurs and the `64-bit = +` is displayed 
St4NNi commented 10 months ago

Hey, yes unfortunately there seems to be something else broken.

I could reproduce this on my side as well. Maybe to just give everyone else also some insights: The Unsorted_CD notification is unrelated, 7z just expects the records to be sorted alphabetically, so you just can .sort() your Vec<PathBuf> and this should be gone.

Minimal working example to easily reproduce this

  1. Create two files with all 0x00 bytes one that is exactly 0xFFFFFF / u32::MAX long -> truncate -s 4294967295 test and one that is any size above (e.g. 10 Bytes) truncate -s 10 test2
  2. Zip both files with Compression::Stored (to make sure that we can more easily inspect the headers and have a more reproducible file size)

    let mut file = File::create("test.zip").await?;
    let mut writer = ZipFileWriter::with_tokio(&mut file);
    
    writer = writer.force_zip64();
    
    for file in ["test", "test2"] {
        let builder = ZipEntryBuilder::new(file.into(), Compression::Stored);
        let mut data = Vec::new();
        File::open(file).await?.read_to_end(&mut data).await?;
        writer.write_entry_whole(builder, &data).await?;
    }
    writer.close().await?;
  3. Examine file integrity / try to decompress the file e.g. 7z t test.zip will result in ERROR: Headers Error : test2 (The first file is fine)

Sadly I may have not the time to investigate this further, but from my understanding this is broken because the lh_offset is beyond the u32 bounds. I tried to clamp this to NON_ZIP64_MAX_SIZE and add this information to relative_header_offset in Zip64ExtendedInformationExtraField but with no luck. I will give an update if I have figured something out.

St4NNi commented 10 months ago

Actually I think I've figured it out. Thanks to the above minimal working example and hexdump comparisons with 7z output as reference.

Problem

Zip64 is needed in one of the following cases:

The previous implementation checked only if A was the case and completely ignored B. The specification says that if the compressed and uncompressed sizes are below 4 Gib the regular fields in the file headers can (and should) be used [ref].

The actual problem was as expected the lh_offset in the central directory file header. This needed to be upper bounded to NON_ZIP64_MAX_SIZE but additionally a Zip64 extended extra field with relative_header_offset was needed (to specify the u64 offset for the actual beginning of the entry).

This is only required in the Central directory and does not need to be added in the local file header that comes in front of the raw data, "valid" fields from the regular format should be skipped completely (in order) in the Zip64 extra fields. Since both compressed and uncompressed sizes are valid no Zip64 extra field is necessary in the local file header. The extra field is only required in the CD file header.

Fix

https://github.com/Majored/rs-async-zip/pull/106/commits/6118ad7b613395c56f735e33a605dc54f12abd7e / https://github.com/Majored/rs-async-zip/pull/106/commits/3e080b736c7ca843bc7a0bbad5d7126df1d828d6 fix this issue by creating a Zip64ExtendedInformationExtraFieldBuilder that keeps track of the added options and creates a valid ZIP64 extra entry via its build method.

If only case B occurs, the extra information is purely added to the central directory file header. If A is the case (with or without B) the local file header is extended by a Zip64 extra entry. The builder also keeps track of the actual expected data_size and removes the need for the hard-coded 16 bytes (because a pure relative_header_offset is only 8 bytes and relative offset + sizes is 24 bytes (8 + 16) in addition to the 16 bytes for (uncompressed + compressed) sizes only.

Caveat: This is more or less just a PoC additional tests especially for the entry_stream are needed to verify that Zip64 handling now behaves correctly.

Eric-Kobayashi commented 10 months ago

Can confirm the commit https://github.com/Majored/rs-async-zip/commit/1466a6c97c4c9407aa00e7247dba6f61cef0ad7d fixes the issue during my test:

Scanning the drive for archives: 1 file, 4666009079 bytes (4450 MiB)

Extracting archive: mac_async_zip_zip64_fix.zip

Path = mac_async_zip_zip64_fix.zip Type = zip Physical Size = 4666009079 64-bit = +

Everything is Ok

Files: 18153 Size: 7659607926 Compressed: 4666009079

7-Zip (z) 22.01 (arm64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15 64-bit arm_v:8 locale=C.UTF-8 Threads:6

Scanning the drive for archives: 1 file, 4666009079 bytes (4450 MiB)

Extracting archive: mac_async_zip_zip64_fix.zip

Path = mac_async_zip_zip64_fix.zip Type = zip Physical Size = 4666009079 64-bit = + Characteristics = Zip64

Everything is Ok

Files: 18153 Size: 7659607926 Compressed: 4666009079


- The Mac system utility is also able to unzip the archive

Thanks @St4NNi for providing the fix and hope #106 will be reviewed and merged.
Majored commented 9 months ago

@Eric-Kobayashi Do you think you could retest the current main branch for me in relation to this? Just want to double check none of the conflicts from two later PRs have caused any issues.

Eric-Kobayashi commented 9 months ago

@Majored Can confirm the current main branch works as expected.