archiverjs / node-archiver

a streaming interface for archive generation
https://www.archiverjs.com
MIT License
2.8k stars 219 forks source link

Deterministic archive? #383

Open janpio opened 5 years ago

janpio commented 5 years ago

Is there a way to make archiver create identical archives each time it is run with the same files and options?

Currently the resulting archive file size is identical, but the file itself is structured in a slightly different way so that calculating the checksum of several archives give different results :/

avoinkov commented 5 years ago

What type of archive you use?

For GZIP we have code, this is base64 encoded binary content of GZIP archive:

const gzipHeader = {
    darwin: 'H4sIAAAAAAAAE2',
    win32: 'H4sIAAAAAAAACm',
    linux: 'H4sIAAAAAAAAA2',
}[os.platform()];

For ZIP we specify date:

zip.append(chain, {
    name: `customers.csv`,
    date: new Date('2000-07-18T20:18:24.441Z'),
}).finalize();

Note: file name that is appended to archive also encoded, so it should be preserved in order to get exactly same file content

avoinkov commented 5 years ago

For additional information about GZIP header see https://www.forensicswiki.org/wiki/Gzip

janpio commented 5 years ago

I was indeed using ZIP, so I will try gzip and see if this already fixes my problem. That would be super awesome. Will report back.

andreieftimie commented 3 years ago

For zip even with specifying the same date, the hash of the zip file differs (even though the contents are identical). Does anyone have a way of achieving deterministic zip archives?

Edit. I retract. With specifying the date it does seem to work!

lencioni commented 1 month ago

I am finding that specifying the same date is a good start, but when using the .directory method I am still sometimes getting different archives. It seems that the order of files is sometimes different (verified by logging entry.name in the entry event), which is contributing to some nondeterminism.

It appears that the directory method (and the glob method) uses the readdir-glob package:

https://github.com/archiverjs/node-archiver/blob/0830dea0b3798d14d33b454005628958f4611586/lib/core.js#L679

https://github.com/archiverjs/node-archiver/blob/0830dea0b3798d14d33b454005628958f4611586/lib/core.js#L9

After reading through the readdir-glob code for a bit, I see that here is some asynchronicity happening but nothing that will specifically account for nondeterminism in the order of files (e.g. no sorting).

I suspect I can work around this by avoiding the directory and glob methods from archiver in my code and do the globbing myself, but it would be nice if archiver handled this out of the box. At minimum, it might be a good idea to document this for these methods.

lencioni commented 1 month ago

After stabilizing the glob order by sorting, I still was getting nondeterministic zip files and I could see that the order of the entries was still not stable. I read though some more of the code and I believe this is caused by the fs stat concurrency, which defaults to 4 and causes the stat queue to be consumed in a nondeterministic order. I can stabilize this by setting statConcurrency: 1 in the constructor options or by passing the stat option through in the data to bypass the stat queue completely. I assume this will come with a performance penalty, but my zips are now deterministic.