Closed GunnarFarneback closed 4 years ago
Cool! My first very casual reaction is that it shares more code with extract than I’m happy with so I might want to figure out a common abstraction that both can be implemented in terms of. And for the gen_static.jl script in PkgServer.jl I just needed something that got access to each file as an I/O handle instead of as an extracted file, so I think there’s a commonality there: get each header with an I/O handle you can read from for each file—to extract, write it to a file; to hash, call a hash function on it. The only question is how to handle non-files, i.e. symlinks and directories.
Yeah, I intentionally started it as a copy of extract
to see how much needed to be replaced.
My first thought was to replace the predicate
function of extract
with an action
function. Feeding it with header
+ io
should be good, but you would probably want to wrap the io
in a view that can only see the current part of the file and yields eof
when you reach datalen
. In that case non-files would get an io
object that effectively works like an empty file.
API function and a test added. Questions:
GitTools
and should either be removed or completed.I spent some time refactoring the tarball reading logic so that the same core can support both extract_tarball
and hash_tree
, which I implemented here: https://github.com/JuliaIO/Tar.jl/pull/36. Sorry for sniping your PR! I also added a lot of tests and wrote the docs, so it's pretty complete at this point 😬
I added support for the classic git-sha1
algorithm and the upcoming git-sha256
algorithm, but I made the algorithm selection argument a string to eliminate the external coupling with the SHA package. That way support for non-git tree hashing algorithms could be added at some point.
Thank you for spurring me to implement this. Sorry I didn't use your version 😢
No problem. I'm just trying to understand why your SHA-256 doesn't give the same result as mine for non-empty tarballs.
Hmm. That's troubling. Do you have an example?
I'm looking at a tar archive containing a single file named "foo" with content "foo".
I was looking for a place where one of us forgets to pass an optional HashType argument, but I don't see that in either of our code (I made the HashType non-optional in all internal methods). I wonder if we can figure out how to use git as a reference here—put it in SHA256 mode.
Could it be that this part of the GitTools
implementation:
for (n, h, m) in entries
content_size += ndigits(UInt32(m); base=8) + 1 + sizeof(n) + 1 + 20
end
is incorrect for other hash types than the default? Specifically, should "20" rather be a function of the length of the hash?
That's definitely what differs. Changing 20 to 32 for SHA256 gives matching results. Would certainly be good to verify with git, but this looks unlikely not to be a GitTools
bug.
Ah yes! That would do it. That 20 is 20 bytes == 160 bits of SHA1 hash. My version doesn't have this issue because instead of computing the length like that, I just write the data to a buffer and take its length, which is much more foolproof. Doesn't really matter for performance either since the size of the buffer for directory info is never going to be large enough to matter. For file data, I optimized the hashing in https://github.com/JuliaIO/Tar.jl/pull/37, so that it does the hashing incrementally instead of writing everything to a buffer just to measure the length.
I'm working on a separate package that factors out all the content tree logic anyway, so Pkg.GitTools will go away at some point (it's kind of a motley collection of random stuff anyway). It might make sense to factor common git hashing code into a standalone package, but for now it's such a small amount of code that just duplicating it here and there seems ok. No one is actually using GitTools.tree_hash with any non-default hash type at this point, so it's fine to leave it as is.
Implements #19. Hashing works but testing and top level interface is missing.