drivendataorg / repro-zipfile

A tiny, zero-dependency replacement for Python's zipfile.ZipFile for creating reproducible/deterministic ZIP archives.
Other
12 stars 4 forks source link

Binary zip not equal due to differences in `external_attr` (with `write()`) #3

Closed fritz-hh closed 6 months ago

fritz-hh commented 7 months ago

Used version v0.1.0 Python version 3.8.x (constrained by my working environment)

Description: I am using repro-zipfile in a continuous integration pipeline (Jenkins) with several nodes. I am experiencing an issue where the resulting zip file sometimes differs due to changes in the External file attributes field in the Central directory file header (see https://en.wikipedia.org/wiki/ZIP_(file_format)#Central_directory_file_header).

I am currently not sure what exactly causes the issue (It could be either a git configuration issue on one of the Jenkins node, that leads to the file having different permissions on different nodes) But I would propose a modification to the write() method, so that it does not consider anymore the original file permissions (but set the permissions by default to rwx for owner, group and others). That way, a modification of the file permissions would not lead to a different zip binary.

This change would actually be inline with the goal that you stated in your README:

"Reproducible" or "deterministic" in this context means that the binary content of the ZIP archive is identical if you add files with identical binary content in the same order

What do think think about this proposal?

fritz-hh commented 7 months ago

I finally decided to use ZipFile.writrestr() from the standard library. Since it is possible to pass a ZipInfo object as argument is the method, it give me all the control I need to ensure that the resulting zip file remaining always binary equal.

jayqi commented 6 months ago

Sorry for the late response to this. I had encountered something similar with file permissions, and in my case it was due to the input files having different permissions because of umask, which is what determines the permissions of newly created files.

https://www.cyberciti.biz/tips/understanding-linux-unix-umask-value-usage.html

In particular, if you are a root user vs. a non-root user with default umask settings on Unix systems, files get created with different permissions.

I agree that forcing a fixed file permissions is the right way to address this, and will work on making that change.

jayqi commented 6 months ago

Hi @fritz-hh,

v0.2.0 has been released which sets the permissions to fixed values. I used a more restrictive default value than what you suggested—0o644 (rw-r--r--) for files and 0o755 (rwxr-xr-x) for directories. rwxrwxrwx felt too aggressive to me. If you want to use different values, you can set the REPRO_ZIPFILE_FILE_MODE and/or REPRO_ZIPFILE_DIR_MODE environment variables.

fritz-hh commented 6 months ago

Hi @jayqi. Thanks a lot for feedback to my issue and your fix. Your proposed rights are totally fine for me. Regards from Hamburg Germany.