Open MatthewCroughan opened 2 years ago
To copy the result of the investigation from #6479: the problem is that nix::emitTreeAttrs()
, in an attempt to add or normalize rev
, replaces whatever the user specifies with the result of Hash::gitRev()
(whether or not Git is used), which returns an unadorned hexadecimal string without any type prefix the user-specified revision hash might have had. This means the type of rev
is irrecoverably lost and nothing can parse it back anymore unless it’s (the assumed default of) SHA-1.
My initial suggestion was to replace the call to rev->gitRev()
in nix::emitTreeAttrs()
with a call to rev->to_string(Base16, rev->type != htSHA1)
(that is add a prefix unless the hash is SHA-1), but I’ve convinced myself since then that the proper way in a multiple-hash-functions world opened by #6376 would be to make Hash::gitRev()
itself add a prefix to non-SHA-1 hashes.
(Of course, the problem is that the upcoming SHA-256 support in Git almost certainly won’t, and the existing SHA3-256 support in Fossil definitely doesn’t, use a type prefix, instead relying on the length and tacitly assuming a base-16 encoding, so the name of Hash::gitRev()
will be something of a lie. But that discrepancy was already introduced in #6376, now we’re just dealing with the fallout. Besides, as long as Nix insists on interpreting revision hashes as actual Hash
es and not opaque strings, we’re guaranteed to have this problem: there is more than one hash type with a given length in the world. There’s even more than one 160-bit hash in the world, we’re just lucky that so far every VCS uses SHA-1 and none use RIPEMD-160 or whatnot.)
Besides, as long as Nix insists on interpreting revision hashes as actual Hashes and not opaque strings, we’re guaranteed to have this problem
That’s actually a good point: Is there a strong reason for Nix to interpret these? Very briefly glancing at the code it doesn’t seem like it’s the case, and it could probably make sense for these to just be opaque blobs
Isn't the reason not having to specify a second hash when a commit hash is available?
@toonn I think so yes. But I don’t think parsing the hash is required for that. The only thing that the code does, is that if the input has a rev
field then it will assume it to be “locked”, meaning that the fetcher can be executed even in pure mode. But that means that we could have the same thing without having to actually parse it
Doesn't checking the hash require at least parsing whether it's SHA1 or SHA256, etc., to even be able to do a string comparison?
@toonn As far as I can tell, among Git, Mercurial, Bazaar, Fossil, and Darcs, all either use a single 160-bit hash (SHA-1 or a patched version) or a single 160-bit hash and a single 256-bit hash (SHA-256 in Git, SHA3-256 in Fossil, Mercurial is discussing a BLAKE2 variant), always represent them as hexadecimal strings, and distinguish them by length on input. (Pijul has its own thing because it needs homomorphic hashing.)
More generally, I’d say the exact structure of the revision identifier concerns the VCS, not Nix: all Nix’s built-in fetchers (seem to) depend upon is that they can give the identifier string to the VCS and (barring failures) get a tree back, in a referentially transparent manner. How the VCS implements that mapping and how its developers defined the set of valid identifier strings does not matter.
The only exception I can see here is that Nix might want the revision identifier to be canonical in some sense. This is, to some degree, a nice-to-have—nothing prevents two different revision histories from yielding the same tree, in which case you have duplicates anyway—but I can see why you might still want that. In the current SHA-1-only world of Nix this amounts to always using lowercase hexadecimals, and I think that currently Nix will actively convert revisions to lowercase as a side (only?..) effect of the Hash::gitRev()
call in emitTreeAttrs()
.
If we’re seriously considering the possibility of a hypothetical future VCS using case-sensitive (or, less critically, preferred-uppercase) revisions, I see three choices:
rev
attribute from fetchTree
, changes in its case (very slight semantic change, may cause problems in conjunction with content-addressability?);emitTreeAttrs()
) is responsible for canonicalizing its own identifiers: nothing breaks, but hard to test with no current use cases.(Honestly I think whoever is currently passing manually-uppercased commit hashes to rev=
deserves what’s coming to them, but obviously that depends on how many such perverts there are.)
A final thing that this may expose is that the VCS-implemented mapping alluded to above is referentially transparent modulo failures when the full hash is being passed, but most places in the VCS’s UI will also accept hash prefixes in the same place in a non-referentially-transparent manner, so it’s possible for Nix to think it’s passing the full 40 characters of SHA-1 but for the VCS to think it’s receiving a 40-character prefix of the full 64 characters of $HASH-256. I can’t think of a scenario where that would matter that wouldn’t amount to a full break of at least one of the hashes, but IANAC, and any VCS fetcher that wants to deal with multiple hashes would need to be carefully audited to ensure it’s telling the VCS not to accept abbreviations.
I agree the structure/contents of the revision hashes are the VCS's business. It's just that Nix needs to understand them so it can check fetch results for Fixed Output Derivations.
@toonn I might be misunderstanding something very simple here, but how does Nix check the results? As far as I understand any such mechanism will boil down to “point VCS at result and ask it”, in which case we’re still comparing “string in rev
” to “string VCS returned”, no understanding of structure needed.
Or are you concerned about the VCS having multiple types of equally good identifiers for the same revision, so that when asking we’ll need to tell it which one we want? (Like commits having both a SHA-1 and a SHA-256.) That’s not the case for any of the VCSes mentioned so far (though depending on how the Git SHA-256 transition proceeds it might become a concern).
I'm probably the one misunderstanding things here. I guess I was considering the role of the VCS more like the role of cURL in fetchurl, an untrusted (potentially for external reasons) fetcher whose results need to be checked by manual hashing, hence the option of providing a hash rather than relying on the revision hash. But it does make sense to treat something like git
as equally trustworthy as sha1sum or whatever implementation of a particular hash function.
It's worth noting that revs aren't always strictly SHA1 compatible. Currently Nix is incompatible with short revs.
I think the best thing we can do is to treat the rev as a string and pass it on verbatim to Git (or whichever other SCM is in use). Being clever about it just gets in the way.
@adisbladis I fully agree, and I was reluctant at first to accept that. I thought Nix was being super untrusting, and that this was valid. But if you can't trust the SCM, then the SCM shouldn't be used at all, or included with mainline Nix. If that is ever the case, there should be debate on why the SCM is untrustworthy, non-reproducible, buggy, and not worth including.
@adisbladis Do you mean to imply Nix should accept short revs? (Otherwise I don’t really understand your point.) My understanding is that fetchTree
(and built-in fetching in general) is primarily there to support pure mode (impurely you could just shell out to curl or git or ...), and abbreviated hashes are not at all guaranteed to be unique. This would imply fetchTree
is right to only accept full ones.
@alexshpilkin Short revs are expanded to full revs by the revision control system. So whether or not they're reproducibly valid is up to the SCM. In the case of Git, giving it a colliding short rev will return error: short object ID abcd is ambiguous
. This is tested for in git
here https://github.com/git/git/blob/715d08a9e51251ad8290b181b6ac3b9e1f9719d7/t/t1512-rev-parse-disambiguation.sh. Preventing the usage of short revs in Nix is still probably a good idea, since it guarantees that the remote adding commits won't effect the reproducibility of the build. If we allow short revs, then a future collision added to the repository will cause nix expressions to stop working when they fetch the git tree again.
You could make the argument that no SCM should accept short revs.
Do you mean to imply Nix should accept short revs?
Yes, I do think short revs should be supported.
Preventing the usage of short revs in Nix is still probably a good idea
The problem is that we already have tooling (such as Go modules) which only record short revs, and these revs are now unusable in Nix.
You could make the argument that no SCM should accept short revs.
While I agree with this we're already way past that point.
Short revs should only be allowed in impure mode or in flake input specifications (not flake lock files), since it's not really hard to find collisions for short revs.
Short revs should only be allowed in impure mode or in flake input specifications (not flake lock files), since it's not really hard to find collisions for short revs.
That sounds like a good set of trade-offs.
Describe the bug
On the latest master, specifying
sha256:
as a prefix results in an unexpected error. It believes that my input sha256 should be a sha1, instead of a sha256.Expected behavior
It is expected that specifying
sha256:
as a prefix should allow a sha256 to be used, but instead it assumes the only valid input is a sha1, despite https://github.com/NixOS/nix/issues/6368 and https://github.com/NixOS/nix/pull/6376nix-env --version
outputnix-env (Nix) 2.9.0pre20220503_9489b4b