Merkle verification of non-content-addressed data

Problem

In a world where CA derivations take over, this is a non-issue: the use of exclusively content-addressing store paths mean that we are effectively constructing merkle dags / doing deep content addressing for entire closures:

A single content-addressing store path verifies the object it refers to's entire closure, because references (as store paths) effect the calculation, and those are likewise content-addressing.
A resolved derivation (only has inputSrcs) likewise has a completely unambiguous input closure, allowing it to serve as a properly *shallow trace key

However, a "mixed store" with some content-addressed and some input-addressed store object doesn't have any of these nice properties, because a single input addressed reference "breaks" the transitive guarantees, meaning we don't really know anything about the input-addressed object or closure (even the content-addressed objects in its closure.

Despite my slowness, I am not worried about the technical changes to Nix and Hydra that allow us to start using content-addressing derivations "for real". Rather, I am worried about software the contains pathological self-references for which we'll have no choice but to continue using store paths that are fixed at build time (actually they could be input addressed or just randomly generated, it doesn't matter) so as to avoid rewrites. I hope such software is rare/unimportant, but I don't know whether that will be the case.

Solution

Just because some store objects are "mounted" at non-content-addressed store paths doesn't mean we need to give up on content-addressing! The escape hatch is simple that we can use a content address in addition to a store path to lock down the store object's contents.

Indeed, we already do a version of this with "NAR hashes" --- we use those even when the object is input-addressed or content-addressed in a non-NAR way. It just happens that NAR hashes are not adequate for this task because they only track individual objects not closure.

Imagine a "deep NAR hash" that was is a combination of the store object's own files nar hash, and the references as a map, a map from "store path" to "deep NAR hash".

struct DeepNARHash { Hash h };

DeepNARHash DeepNARHash::calc(Hash narHash, std::map<StorePath, DeepNARHash>);

This inductive structure gives us the Merkle hashing for whole closure verification we want

We update ValidPathInfo with

- std::set<StorePath> references;
+ std::map<StorePath, DeepNARHash> references;

For CA derivations shallow traces we likewise want to make inputs a parameter on DerivationNew, so we have


using BasicDerivation = DerivationNew<std::set<StorePath>>

using Derivation = DerivationNew<std::pair<
    std::set<StorePath>,
    DerivedPathMap<std::set<OutputName>>
>;

// new
using UnambigousDerivation = DerivationNew<std::map<StorePath, DeepNARHash>>

This restores the properties we want --- they don't come from the store paths now (unless the store path happens to be content-addressing, in which case it still does), but instead the new deep NAR hash. In addition, it vastly lowers the stakes for input-addressing.

We don't need to worry about the quality of input addressing / collisions because accidental frankenbuilds are no longer possible. This is why it is fine to image we just randomly generated non-CA paths --- even if there were collisions, we would detect them when different actions wanted to "use" the store path for different store objects.

NixOS / nix

Merkle verification of non-content-addressed data #11919

Problem

Solution