[fetch/git] Specifying hash prefix such as `sha256:` still fails and assumes sha1

MatthewCroughan commented 2 years ago

Describe the bug

On the latest master, specifying sha256: as a prefix results in an unexpected error. It believes that my input sha256 should be a sha1, instead of a sha256.

nix-repl> builtins.fetchTree "git+https://nixos/nixpkgs?rev=sha256:2a73956a200d4a02d600f9cdc271f316274ecd2af1cefdc88f982e4dcd6585"
error: hash 'sha256:2a73956a200d4a02d600f9cdc271f316274ecd2af1cefdc88f982e4dcd6585' should have type 'sha1'

nix-repl> builtins.fetchTree "git+https://nixos/nixpkgs?rev=2a73956a200d4a02d600f9cdc271f316274ecd2af1cefdc88f982e4dcd6585"        
error: hash '2a73956a200d4a02d600f9cdc271f316274ecd2af1cefdc88f982e4dcd6585' has wrong length for hash type 'sha1'

Expected behavior

It is expected that specifying sha256: as a prefix should allow a sha256 to be used, but instead it assumes the only valid input is a sha1, despite https://github.com/NixOS/nix/issues/6368 and https://github.com/NixOS/nix/pull/6376

nix-env --version output nix-env (Nix) 2.9.0pre20220503_9489b4b

alexshpilkin commented 2 years ago

To copy the result of the investigation from #6479: the problem is that nix::emitTreeAttrs(), in an attempt to add or normalize rev, replaces whatever the user specifies with the result of Hash::gitRev() (whether or not Git is used), which returns an unadorned hexadecimal string without any type prefix the user-specified revision hash might have had. This means the type of rev is irrecoverably lost and nothing can parse it back anymore unless it’s (the assumed default of) SHA-1.

My initial suggestion was to replace the call to rev->gitRev() in nix::emitTreeAttrs() with a call to rev->to_string(Base16, rev->type != htSHA1) (that is add a prefix unless the hash is SHA-1), but I’ve convinced myself since then that the proper way in a multiple-hash-functions world opened by #6376 would be to make Hash::gitRev() itself add a prefix to non-SHA-1 hashes.

(Of course, the problem is that the upcoming SHA-256 support in Git almost certainly won’t, and the existing SHA3-256 support in Fossil definitely doesn’t, use a type prefix, instead relying on the length and tacitly assuming a base-16 encoding, so the name of Hash::gitRev() will be something of a lie. But that discrepancy was already introduced in #6376, now we’re just dealing with the fallout. Besides, as long as Nix insists on interpreting revision hashes as actual Hashes and not opaque strings, we’re guaranteed to have this problem: there is more than one hash type with a given length in the world. There’s even more than one 160-bit hash in the world, we’re just lucky that so far every VCS uses SHA-1 and none use RIPEMD-160 or whatnot.)

thufschmitt commented 2 years ago

Besides, as long as Nix insists on interpreting revision hashes as actual Hashes and not opaque strings, we’re guaranteed to have this problem

That’s actually a good point: Is there a strong reason for Nix to interpret these? Very briefly glancing at the code it doesn’t seem like it’s the case, and it could probably make sense for these to just be opaque blobs

toonn commented 2 years ago

Isn't the reason not having to specify a second hash when a commit hash is available?

thufschmitt commented 2 years ago

@toonn I think so yes. But I don’t think parsing the hash is required for that. The only thing that the code does, is that if the input has a rev field then it will assume it to be “locked”, meaning that the fetcher can be executed even in pure mode. But that means that we could have the same thing without having to actually parse it

toonn commented 2 years ago

Doesn't checking the hash require at least parsing whether it's SHA1 or SHA256, etc., to even be able to do a string comparison?

alexshpilkin commented 2 years ago

@toonn As far as I can tell, among Git, Mercurial, Bazaar, Fossil, and Darcs, all either use a single 160-bit hash (SHA-1 or a patched version) or a single 160-bit hash and a single 256-bit hash (SHA-256 in Git, SHA3-256 in Fossil, Mercurial is discussing a BLAKE2 variant), always represent them as hexadecimal strings, and distinguish them by length on input. (Pijul has its own thing because it needs homomorphic hashing.)

More generally, I’d say the exact structure of the revision identifier concerns the VCS, not Nix: all Nix’s built-in fetchers (seem to) depend upon is that they can give the identifier string to the VCS and (barring failures) get a tree back, in a referentially transparent manner. How the VCS implements that mapping and how its developers defined the set of valid identifier strings does not matter.

The only exception I can see here is that Nix might want the revision identifier to be canonical in some sense. This is, to some degree, a nice-to-have—nothing prevents two different revision histories from yielding the same tree, in which case you have duplicates anyway—but I can see why you might still want that. In the current SHA-1-only world of Nix this amounts to always using lowercase hexadecimals, and I think that currently Nix will actively convert revisions to lowercase as a side (only?..) effect of the Hash::gitRev() call in emitTreeAttrs().

If we’re seriously considering the possibility of a hypothetical future VCS using case-sensitive (or, less critically, preferred-uppercase) revisions, I see three choices:

Remove the case mapping: people currently using non-canonical (uppercase) identifiers may see some duplicated work (no semantic change I think?) and, if they also use the rev attribute from fetchTree, changes in its case (very slight semantic change, may cause problems in conjunction with content-addressability?);
Remove the case mapping and have every VCS fetcher check canonicity alongside other existing checks (my preference): people currently using non-canonical identifiers will have their derivations break (loud semantic change, but no silent breakage);
Rearchitect the current code so that every VCS fetcher (and not emitTreeAttrs()) is responsible for canonicalizing its own identifiers: nothing breaks, but hard to test with no current use cases.

(Honestly I think whoever is currently passing manually-uppercased commit hashes to rev= deserves what’s coming to them, but obviously that depends on how many such perverts there are.)

A final thing that this may expose is that the VCS-implemented mapping alluded to above is referentially transparent modulo failures when the full hash is being passed, but most places in the VCS’s UI will also accept hash prefixes in the same place in a non-referentially-transparent manner, so it’s possible for Nix to think it’s passing the full 40 characters of SHA-1 but for the VCS to think it’s receiving a 40-character prefix of the full 64 characters of $HASH-256. I can’t think of a scenario where that would matter that wouldn’t amount to a full break of at least one of the hashes, but IANAC, and any VCS fetcher that wants to deal with multiple hashes would need to be carefully audited to ensure it’s telling the VCS not to accept abbreviations.

toonn commented 2 years ago

I agree the structure/contents of the revision hashes are the VCS's business. It's just that Nix needs to understand them so it can check fetch results for Fixed Output Derivations.

alexshpilkin commented 2 years ago

@toonn I might be misunderstanding something very simple here, but how does Nix check the results? As far as I understand any such mechanism will boil down to “point VCS at result and ask it”, in which case we’re still comparing “string in rev” to “string VCS returned”, no understanding of structure needed.

Or are you concerned about the VCS having multiple types of equally good identifiers for the same revision, so that when asking we’ll need to tell it which one we want? (Like commits having both a SHA-1 and a SHA-256.) That’s not the case for any of the VCSes mentioned so far (though depending on how the Git SHA-256 transition proceeds it might become a concern).

toonn commented 2 years ago

I'm probably the one misunderstanding things here. I guess I was considering the role of the VCS more like the role of cURL in fetchurl, an untrusted (potentially for external reasons) fetcher whose results need to be checked by manual hashing, hence the option of providing a hash rather than relying on the revision hash. But it does make sense to treat something like git as equally trustworthy as sha1sum or whatever implementation of a particular hash function.

adisbladis commented 2 years ago

It's worth noting that revs aren't always strictly SHA1 compatible. Currently Nix is incompatible with short revs.

I think the best thing we can do is to treat the rev as a string and pass it on verbatim to Git (or whichever other SCM is in use). Being clever about it just gets in the way.

MatthewCroughan commented 2 years ago

@adisbladis I fully agree, and I was reluctant at first to accept that. I thought Nix was being super untrusting, and that this was valid. But if you can't trust the SCM, then the SCM shouldn't be used at all, or included with mainline Nix. If that is ever the case, there should be debate on why the SCM is untrustworthy, non-reproducible, buggy, and not worth including.

alexshpilkin commented 2 years ago

@adisbladis Do you mean to imply Nix should accept short revs? (Otherwise I don’t really understand your point.) My understanding is that fetchTree (and built-in fetching in general) is primarily there to support pure mode (impurely you could just shell out to curl or git or ...), and abbreviated hashes are not at all guaranteed to be unique. This would imply fetchTree is right to only accept full ones.

MatthewCroughan commented 2 years ago

@alexshpilkin Short revs are expanded to full revs by the revision control system. So whether or not they're reproducibly valid is up to the SCM. In the case of Git, giving it a colliding short rev will return error: short object ID abcd is ambiguous. This is tested for in git here https://github.com/git/git/blob/715d08a9e51251ad8290b181b6ac3b9e1f9719d7/t/t1512-rev-parse-disambiguation.sh. Preventing the usage of short revs in Nix is still probably a good idea, since it guarantees that the remote adding commits won't effect the reproducibility of the build. If we allow short revs, then a future collision added to the repository will cause nix expressions to stop working when they fetch the git tree again.

You could make the argument that no SCM should accept short revs.

adisbladis commented 2 years ago

Do you mean to imply Nix should accept short revs?

Yes, I do think short revs should be supported.

Preventing the usage of short revs in Nix is still probably a good idea

The problem is that we already have tooling (such as Go modules) which only record short revs, and these revs are now unusable in Nix.

You could make the argument that no SCM should accept short revs.

While I agree with this we're already way past that point.

edolstra commented 2 years ago

Short revs should only be allowed in impure mode or in flake input specifications (not flake lock files), since it's not really hard to find collisions for short revs.

adisbladis commented 2 years ago

Short revs should only be allowed in impure mode or in flake input specifications (not flake lock files), since it's not really hard to find collisions for short revs.

That sounds like a good set of trade-offs.

NixOS / nix

[fetch/git] Specifying hash prefix such as `sha256:` still fails and assumes sha1 #6496