JuliaIO / Tar.jl

TAR files: create, list, extract them in pure Julia
MIT License
79 stars 19 forks source link

[WIP] Tree hash computation without extraction. #35

Closed GunnarFarneback closed 4 years ago

GunnarFarneback commented 4 years ago

Implements #19. Hashing works but testing and top level interface is missing.

StefanKarpinski commented 4 years ago

Cool! My first very casual reaction is that it shares more code with extract than I’m happy with so I might want to figure out a common abstraction that both can be implemented in terms of. And for the gen_static.jl script in PkgServer.jl I just needed something that got access to each file as an I/O handle instead of as an extracted file, so I think there’s a commonality there: get each header with an I/O handle you can read from for each file—to extract, write it to a file; to hash, call a hash function on it. The only question is how to handle non-files, i.e. symlinks and directories.

GunnarFarneback commented 4 years ago

Yeah, I intentionally started it as a copy of extract to see how much needed to be replaced.

My first thought was to replace the predicate function of extract with an action function. Feeding it with header + io should be good, but you would probably want to wrap the io in a view that can only see the current part of the file and yields eof when you reach datalen. In that case non-files would get an io object that effectively works like an empty file.

GunnarFarneback commented 4 years ago

API function and a test added. Questions:

StefanKarpinski commented 4 years ago

I spent some time refactoring the tarball reading logic so that the same core can support both extract_tarball and hash_tree, which I implemented here: https://github.com/JuliaIO/Tar.jl/pull/36. Sorry for sniping your PR! I also added a lot of tests and wrote the docs, so it's pretty complete at this point 😬

StefanKarpinski commented 4 years ago

I added support for the classic git-sha1 algorithm and the upcoming git-sha256 algorithm, but I made the algorithm selection argument a string to eliminate the external coupling with the SHA package. That way support for non-git tree hashing algorithms could be added at some point.

StefanKarpinski commented 4 years ago

Thank you for spurring me to implement this. Sorry I didn't use your version 😢

GunnarFarneback commented 4 years ago

No problem. I'm just trying to understand why your SHA-256 doesn't give the same result as mine for non-empty tarballs.

StefanKarpinski commented 4 years ago

Hmm. That's troubling. Do you have an example?

GunnarFarneback commented 4 years ago

I'm looking at a tar archive containing a single file named "foo" with content "foo".

StefanKarpinski commented 4 years ago

I was looking for a place where one of us forgets to pass an optional HashType argument, but I don't see that in either of our code (I made the HashType non-optional in all internal methods). I wonder if we can figure out how to use git as a reference here—put it in SHA256 mode.

GunnarFarneback commented 4 years ago

Could it be that this part of the GitTools implementation:

    for (n, h, m) in entries
        content_size += ndigits(UInt32(m); base=8) + 1 + sizeof(n) + 1 + 20
    end

is incorrect for other hash types than the default? Specifically, should "20" rather be a function of the length of the hash?

GunnarFarneback commented 4 years ago

That's definitely what differs. Changing 20 to 32 for SHA256 gives matching results. Would certainly be good to verify with git, but this looks unlikely not to be a GitTools bug.

StefanKarpinski commented 4 years ago

Ah yes! That would do it. That 20 is 20 bytes == 160 bits of SHA1 hash. My version doesn't have this issue because instead of computing the length like that, I just write the data to a buffer and take its length, which is much more foolproof. Doesn't really matter for performance either since the size of the buffer for directory info is never going to be large enough to matter. For file data, I optimized the hashing in https://github.com/JuliaIO/Tar.jl/pull/37, so that it does the hashing incrementally instead of writing everything to a buffer just to measure the length.

StefanKarpinski commented 4 years ago

I'm working on a separate package that factors out all the content tree logic anyway, so Pkg.GitTools will go away at some point (it's kind of a motley collection of random stuff anyway). It might make sense to factor common git hashing code into a standalone package, but for now it's such a small amount of code that just duplicating it here and there seems ok. No one is actually using GitTools.tree_hash with any non-default hash type at this point, so it's fine to leave it as is.