In a world where CA derivations take over, this is a non-issue: the use of exclusively content-addressing store paths mean that we are effectively constructing merkle dags / doing deep content addressing for entire closures:
A single content-addressing store path verifies the object it refers to's entire closure, because references (as store paths) effect the calculation, and those are likewise content-addressing.
A resolved derivation (only has inputSrcs) likewise has a completely unambiguous input closure, allowing it to serve as a properly *shallow trace key
However, a "mixed store" with some content-addressed and some input-addressed store object doesn't have any of these nice properties, because a single input addressed reference "breaks" the transitive guarantees, meaning we don't really know anything about the input-addressed object or closure (even the content-addressed objects in its closure.
Despite my slowness, I am not worried about the technical changes to Nix and Hydra that allow us to start using content-addressing derivations "for real". Rather, I am worried about software the contains pathological self-references for which we'll have no choice but to continue using store paths that are fixed at build time (actually they could be input addressed or just randomly generated, it doesn't matter) so as to avoid rewrites. I hope such software is rare/unimportant, but I don't know whether that will be the case.
Solution
Just because some store objects are "mounted" at non-content-addressed store paths doesn't mean we need to give up on content-addressing! The escape hatch is simple that we can use a content address in addition to a store path to lock down the store object's contents.
Indeed, we already do a version of this with "NAR hashes" --- we use those even when the object is input-addressed or content-addressed in a non-NAR way. It just happens that NAR hashes are not adequate for this task because they only track individual objects not closure.
Imagine a "deep NAR hash" that was is a combination of the store object's own files nar hash, and the references as a map, a map from "store path" to "deep NAR hash".
For CA derivations shallow traces we likewise want to make inputs a parameter on DerivationNew, so we have
using BasicDerivation = DerivationNew<std::set<StorePath>>
using Derivation = DerivationNew<std::pair<
std::set<StorePath>,
DerivedPathMap<std::set<OutputName>>
>;
// new
using UnambigousDerivation = DerivationNew<std::map<StorePath, DeepNARHash>>
This restores the properties we want --- they don't come from the store paths now (unless the store path happens to be content-addressing, in which case it still does), but instead the new deep NAR hash. In addition, it vastly lowers the stakes for input-addressing.
We don't need to worry about the quality of input addressing / collisions because accidental frankenbuilds are no longer possible. This is why it is fine to image we just randomly generated non-CA paths --- even if there were collisions, we would detect them when different actions wanted to "use" the store path for different store objects.
Problem
In a world where CA derivations take over, this is a non-issue: the use of exclusively content-addressing store paths mean that we are effectively constructing merkle dags / doing deep content addressing for entire closures:
inputSrcs
) likewise has a completely unambiguous input closure, allowing it to serve as a properly *shallow trace keyHowever, a "mixed store" with some content-addressed and some input-addressed store object doesn't have any of these nice properties, because a single input addressed reference "breaks" the transitive guarantees, meaning we don't really know anything about the input-addressed object or closure (even the content-addressed objects in its closure.
Despite my slowness, I am not worried about the technical changes to Nix and Hydra that allow us to start using content-addressing derivations "for real". Rather, I am worried about software the contains pathological self-references for which we'll have no choice but to continue using store paths that are fixed at build time (actually they could be input addressed or just randomly generated, it doesn't matter) so as to avoid rewrites. I hope such software is rare/unimportant, but I don't know whether that will be the case.
Solution
Just because some store objects are "mounted" at non-content-addressed store paths doesn't mean we need to give up on content-addressing! The escape hatch is simple that we can use a content address in addition to a store path to lock down the store object's contents.
Indeed, we already do a version of this with "NAR hashes" --- we use those even when the object is input-addressed or content-addressed in a non-NAR way. It just happens that NAR hashes are not adequate for this task because they only track individual objects not closure.
Imagine a "deep NAR hash" that was is a combination of the store object's own files nar hash, and the references as a map, a map from "store path" to "deep NAR hash".
This inductive structure gives us the Merkle hashing for whole closure verification we want
We update
ValidPathInfo
withFor CA derivations shallow traces we likewise want to make
inputs
a parameter onDerivationNew
, so we haveThis restores the properties we want --- they don't come from the store paths now (unless the store path happens to be content-addressing, in which case it still does), but instead the new deep NAR hash. In addition, it vastly lowers the stakes for input-addressing.
We don't need to worry about the quality of input addressing / collisions because accidental frankenbuilds are no longer possible. This is why it is fine to image we just randomly generated non-CA paths --- even if there were collisions, we would detect them when different actions wanted to "use" the store path for different store objects.