Open janpio opened 5 years ago
What type of archive you use?
For GZIP we have code, this is base64 encoded binary content of GZIP archive:
const gzipHeader = {
darwin: 'H4sIAAAAAAAAE2',
win32: 'H4sIAAAAAAAACm',
linux: 'H4sIAAAAAAAAA2',
}[os.platform()];
For ZIP we specify date:
zip.append(chain, {
name: `customers.csv`,
date: new Date('2000-07-18T20:18:24.441Z'),
}).finalize();
Note: file name that is appended to archive also encoded, so it should be preserved in order to get exactly same file content
For additional information about GZIP header see https://www.forensicswiki.org/wiki/Gzip
I was indeed using ZIP, so I will try gzip and see if this already fixes my problem. That would be super awesome. Will report back.
For
zip
even with specifying the same date, the hash of the zip file differs (even though the contents are identical).Does anyone have a way of achieving deterministic zip archives?
Edit. I retract. With specifying the date it does seem to work!
I am finding that specifying the same date
is a good start, but when using the .directory
method I am still sometimes getting different archives. It seems that the order of files is sometimes different (verified by logging entry.name
in the entry
event), which is contributing to some nondeterminism.
It appears that the directory
method (and the glob
method) uses the readdir-glob package:
After reading through the readdir-glob code for a bit, I see that here is some asynchronicity happening but nothing that will specifically account for nondeterminism in the order of files (e.g. no sorting).
I suspect I can work around this by avoiding the directory
and glob
methods from archiver in my code and do the globbing myself, but it would be nice if archiver handled this out of the box. At minimum, it might be a good idea to document this for these methods.
After stabilizing the glob order by sorting, I still was getting nondeterministic zip files and I could see that the order of the entries was still not stable. I read though some more of the code and I believe this is caused by the fs stat concurrency, which defaults to 4 and causes the stat queue to be consumed in a nondeterministic order. I can stabilize this by setting statConcurrency: 1
in the constructor options or by passing the stat
option through in the data to bypass the stat queue completely. I assume this will come with a performance penalty, but my zips are now deterministic.
I am still getting nondeterministic zip files when using statConcurrency: 1
or passing in the stats
options. The ordering of the files in the zip is not stable. I am using archiver.directory()
.
Is there a way to make
archiver
create identical archives each time it is run with the same files and options?Currently the resulting archive file size is identical, but the file itself is structured in a slightly different way so that calculating the checksum of several archives give different results :/