coreos / gzran

gzip indexer for random access into compressed files
Apache License 2.0
29 stars 15 forks source link

serialize compression state #2

Open philips opened 9 years ago

philips commented 9 years ago

It would be great to be able to serialize out the compression state/headers so that someone could take this serialized state, an uncompressed item and reproduce an identical asset.

For example in the case of rocket we want to be able to extract a tar.gz and put it on disk. Then at some later date we want to take those files on disk and exactly recreate the tar.gz so we can do a signature validation against.

vbatts commented 9 years ago

agreed.

I'll have to research more on this. While looking at golang's compress/gzip, the IEEE table they use, the gnu gzip crc32 tables includes some the same polynomials, but produces a different output (at the same compression levels). This would be great to acheive.

peebs commented 9 years ago

@vbatts So, compress/gzip produces different compressed outputs then gnu gzip at the same compression level? Do you know if this a result of different gzip headers or rather that the deflate functions between the libraries actually produce different results. If so, that means either one of the implementations deviates from rfc1951 or that the rfc doesn't guarantee reproducibility.

As far as serialization goes in gzran, the main thing to look at is restoring the step field of the point struct. Not understanding use of this field in the decompressor is what tripped me up for awhile during the initial implementation. Otherwise, you could pass the Index (with crc and gzip header) straight to something like Gob.

For the purpose of reproducing a bit-for-bit identical gzip file from the uncompressed data, saving the index isn't strictly necessary, but might be useful for other reasons. For reproducibility you should only need: -to save and restore the gzip headers -ensure that whatever deflate library first deflates the aci, can be reproduced by Go's compress/flate

The last point is the one that needs a little research.

vbatts commented 9 years ago

so, i'll do some rfc reading tomorrow. here was my initial investigation https://gist.github.com/vbatts/43fc209acf37ff21dd87

vbatts commented 9 years ago

Also, RFC 1951 is for deflate. Very much similar to gzip. RFC 1952 is for gzip, and what golang is implemented to.

peebs commented 9 years ago

Ah, this may be a problem. Gzip, in Go, is a header and checksum wrapped around DEFLATE (http://golang.org/pkg/compress/flate/) which is also used by zlib.

I assumed gnu gzip used DEFLATE as well, but the gnu gzip uses LZ77. The two packages don't seem compatible for reproducibility. DEFLATE is based on LZ77 but not the same.

Either we need LZ77 in Go or the initial ACI must be compressed with something implementing DEFLATE.

vbatts commented 9 years ago

then we should review http://golang.org/pkg/compress/lzw/ as well

peebs commented 9 years ago

Didn't even see that! Though, It appears LZW is not the same as LZ77 which is not the same as LZMA, LZSS, LZ78, ect.

vbatts commented 9 years ago

correct.

peebs commented 9 years ago

I'm confused about what compression method gzip uses. Here it seems to use DEFLATE: http://www.gzip.org/algorithm.txt

vbatts commented 9 years ago

also could be on the review-radar https://github.com/pierrec/lz4 (spec http://fastcompression.blogspot.fr/2013/04/lz4-streaming-format-final.html)