Support hashing and serializing directory paths into store

ebkalderon commented 3 years ago

Before we can implement build specs as described in #2, we first need to add a method of hashing and serializing directory trees to the store. In particular, we would like the ability to traverse an arbitrary directory path (built locally) and create a content-addressable tree and package object pair.

The basic algorithm would be as follows:

Enter the directory, looping over every DirEntry at each level down.
Hash every blob you find, taking self-references into account (more on that later). Insert into store, if not present, and create a tree entry. Record any explicit references to the store path you may find.
Convert every symlink into a tree entry, taking self-references into account. Record any explicit references to the store path you may find.
Recurse into every subdirectory. Insert the resulting subtree into the store, if not present, and create a tree entry.
Assemble all subtrees, blobs, and symlinks into one tree object and insert it into the store, if not present.
Construct a new package object from the tree object, path references, package name, and target platform. Insert it into the store, if not present.

Self-references

The subject of self-references is a tricky one. In particular, because the package object's hash is derived from its contents on disk, self-references are impossible to support. Computing package hashes modulo self-references and rewriting them after the fact, as suggested in https://github.com/NixOS/rfcs/pull/17, would not work in our Merkle tree design because it creates a chicken-and-egg problem where the on-disk content no longer exactly matches its content hash.

I believe this can be remedied using a combination of relative paths together with the $ORIGIN (Linux) and @loader_path (macOS) (reference) special prefixes, patched in after the fact, similar to how Spack does things (source code). This is tricky to pull off reliably, as described in Strategies for Binary Relocation in Functional Build Systems by Max McDonnell (this article will be useful for implementing binary relocation later), but the author of Spack responded to the post clarifying a few things, and I believe it should work well enough for eliminating the problem of self-references. In summary, I believe we should:

Use absolute paths (like Nix) for everything whenever possible.
Use relative paths and $ORIGIN/@loader_path for self-references. Rewriting hashes ex post facto just won't cut it.

ebkalderon commented 3 years ago

The basic algorithm is implemented in 0ff0472353e23b3a2b49b4e9cc1e2d7078f5b44f via the addition of a private install_package() method, but self-reference detection and patching is only available for symlinks currently (also untested). Support for blob objects is soon to follow, along with some practical tests.

wmertens commented 3 years ago

Computing package hashes modulo self-references and rewriting them after the fact, as suggested in NixOS/rfcs#17, would not work in our Merkle tree design because it creates a chicken-and-egg problem where the on-disk content no longer exactly matches its content hash.

Could you elaborate? Is it because (as I understand this) each subtree is stored separately and has its own hash? But wouldn't self-references need to know the hashes of parent directories, so you wouldn't even be able to create those references in the first place?

I agree that $ORIGIN should be used as much as possible. On macOS it's pretty much the default.

ebkalderon commented 3 years ago

Wow, I'm surprised to see you commenting on my toy project, @wmertens! It is an honor. :smile: Thanks for all your great contributions to Nix.

There is nothing preventing self-references from being created inside the build directory and copied into the output directory. The real challenge is what happens afterwards, when the $out directory (in Nix terms) must be recursively serialized as Merkle tree objects. The proposed strategy for dealing with this, as mentioned in the issue description, would be to scan for paths in blobs and symlinks containing the $out hash, and patch those out as relative paths before inserting them into the store. This would maintain the content-addressability invariant.

However, I'm well aware that relative paths introduce problems of their own and that some applications don't deal with them very well. I would still like to see how far I can take this idea, though, and I'm currently looking at adapting Spack's relative-path binary relocation work to see how much wisdom could be applied here.

ebkalderon commented 3 years ago

After a bit more thought, I think that path rewriting that the Nix RFC proposes might actually work better under a Merkle tree store model than I had originally anticipated. The key is to include with each Package object an optional list of blob IDs which are known to contain self-references.

Hashing modulo self-references

When blobs are being hashed and serialized into the store, any detected path self-references in their contents are patched out to fixed values before being added into the store, e.g:

/path/to/store/packages/.staging/hello-1.0.0-0000000000000000000000000000000000000000000000000000000000000000/

could be rewritten to:

/path/to/store/packages/0000000000000000000000000000000000000000000000000000000000000000000000000000000000000/

Once the patching is complete, the blob is recorded in a BTreeSet<ObjectId> in the Package object, marking it as containing self-references.

Installation with path rewriting

During package installation, every time write_tree() encounters an object ID known to contain self-references, instead of hard-linking it to the destination as usual, the object is instead copied over to the destination with the self-references patched to their final values like this:

/path/to/store/packages/hello-1.0.0-c17cb4d06cb51d69238b70e45766e9b265c7d70cb5c23e510ce2a940610c3e64//////////

Possible improvements

We could potentially replace the use of BTreeSet<ObjectId> with a BTreeMap<ObjectId, BTreeSet<u64>>, which records the byte offsets to every /000...0/ sequence that needs to replaced. This is great because it lets us avoid searching the entire file linearly to find self-references.

wmertens commented 3 years ago

Actually, I think you can do it at store time. There's only one self-reference per package and it refers to the root of the Merkle tree. So if you can adjust the checksum algo to replace self-references with placeholders, you can calculate the checksum for the root, replace it in all blobs, and you have a self-referential tree! The only difference with the RFC is that the self-reference can't be deduced from the blob hash, so you need to keep the list of self references and offsets so the hashing algo knows what to replace.

ebkalderon commented 3 years ago

That's certainly an interesting prospect, @wmertens. It would make the installation process much simpler for sure, and it would allow us to continue leveraging hard-links to the objects directory. But this would also break the invariant that every file in the objects directory must strictly be named after its real content hash, where every stored object should be individually hashable, with no additional inputs, to validate the Merkle tree's own integrity.

In contrast, there is no such strict requirement on the packages directory, because it only contains on-disk serializations of packages, whose contents are technically free to deviate (as necessary) from the CAS objects they were constructed from. This is a property I had exploited in the previously proposed strategy: by having self-references patched out at store time and replaced with final values only at installation time, the objects directory remains content-addressable without compromise, while the contents of the packages directory would be allowed to vary.

What do you think about this, though? Are my concerns about a non-CA objects directory actually overblown?

wmertens commented 3 years ago

By having a separate installation step, you double storage requirements for each root-referencing file :confused: OTOH, writing the root-reference at install time means that e.g. man pages will be deduplicated across versions. But then you could take it one step further, and write all store references at installation time. That way, indentical libraries that only use different dependencies would be fully de-duplicated. :thinking: interesting...

And in the packages directory, you can still apply the $cas naming of the RFC, which would still allow sharing the entire store between hosts, both objects and packages. The Merkle tree would then serve as a backing store where you can do upgrades by only fetching the additional files and the references map.

But, it does still mean duplicating storage for any file with references, which is a problem for low-resource systems. So then on those systems you would want to only have the packages directory, where the situation is then identical to the RFC.

This can be solved by a special filesystem that does the patching in realtime, but that has its own set of problems and isn't possible on all platforms.

ebkalderon commented 3 years ago

Those are some very valid points, thanks! I agree that the increased storage overhead for every root-referencing file is a notable drawback, although I would like to add that the more identical packages with different versions exist in the store, the absolute cost of this overhead decays exponentially. It would be interesting to see some actual numbers on this, especially with regards to how common self-references are in real-world package sets (recall that ELF/Mach-O RPATHs and symlink targets are expected to be normalized to relative paths at store time, so those won't contribute to this overhead), and whether this additional cost is so egregious in practice that it merits abandoning the approach altogether.

I would also like to point out that patching blobs after the fact would improve network transfer efficiency (only one base object to transfer over the wire for n package versions, instead of transferring n nearly-identical objects for n versions). Also, the ability for Merkle trees to self-validate purely by their on-disk content, using a strictly content-addressable objects directory, is still an important property not to be taken lightly.

Ultimately, these tradeoffs will need to be carefully weighed against each other to select the most optimal approach. :thinking:

wmertens commented 3 years ago

To check real-world numbers, you can take the current .links directory of an optimized store, and pipe each file through a filter that replaces the store references with nulls and then calculates the hash. Any duplicate hashes would be deduplicated by the Merkle tree. Any files without store references would not result in duplicate storage.

ebkalderon commented 3 years ago

After stepping back from data collection and focusing on other things for a while, I think that patching self-references at store time doesn't work because it produces hash collisions in the Merkle tree as soon as you have multiple versions of the same package. Identical blobs with different sets of self-references would both produce the same hash, which could result in the wrong files being selected when instantiating other versions of the same package. Here is an example of such a scenario:

Let package foo-1.0.0-fd53fe2392dc260e9cf414a39aeb43641c10ab48a726c58e76d06a7fe443d660 be committed to the store, with blob $out/bin/foo being hashed modulo its self-references, patched, and inserted into the store.
foo-1.0.0-fd53fe2392dc260e9cf414a39aeb43641c10ab48a726c58e76d06a7fe443d660 is instantiated from the Merkle tree in the packages directory. So far, everything is working as expected.
Let package foo-1.0.0-066d344ef7a60d67e85c627a84ba01c14634bea93a414fc9e062cd2932ef35df be committed to the store. The blob $out/bin/foo is virtually identical to the other package instance, but uses a different path for its self-references. When the blob is hashed modulo its self-references, it produces the same hash as the $out/bin/foo as the other package, so it doesn't get inserted into the store.
foo-1.0.0-066d344ef7a60d67e85c627a84ba01c14634bea93a414fc9e062cd2932ef35df is instantiated from the Merkle tree in the packages directory. The $out/bin/foo from the other package gets reused here because of the hash collision from earlier, and this is a problem because it references paths in foo-1.0.0-fd53fe2392dc260e9cf414a39aeb43641c10ab48a726c58e76d06a7fe443d660 instead of our own installation directory. This is not what we want.

Please correct me if I'm missing any important details here, @wmertens! But if my understanding is correct, I think that patching self-references at install-time would be the only feasible way to go.

ebkalderon commented 3 years ago

As an aside, I also wonder if blobs containing self-references could be compressed into indexed pack files to reduce the cost of their storage overhead, similar to Git? I expect there won't be any significant wins when compressing binary files, but certainly plaintext formats could shrink nicely. And considering these objects would only be referenced occasionally during installation and perhaps garbage collection, it might be an okay tradeoff. Will think about this a bit more, but it might be out of scope for this issue.

ebkalderon commented 3 years ago

An implementation of serializing external directories as packages has landed in 920ae64! :tada: It uses the post-install self-reference patching strategy, and I have confirmed locally that it does not result in the aforementioned hash collision scenario that the store-time patching strategy would presumably be vulnerable to.

However, the precise cost of the additional storage overhead of self-referencing blobs hasn't been evaluated yet. I will avoid closing this issue until I have collected some real-world metrics on the proportion of files with self-references vs. without in large realistic package sets, so we can determine whether the extra storage overhead will be tolerable in the real world or not. Depending on the findings, this implementation might still get refactored/scrapped later in favor of another approach.

EDIT: Revised the original commit due to some doc comment typos.

wmertens commented 3 years ago

Hmm, I began an answer and lost it.

Binaries still compress nicely, so putting them in pack files should be good.

The hash collision scenario you describe is fine if you use the CAS hash instead of $out. foo-1.0.0-fd53fe2392dc260e9cf414a39aeb43641c10ab48a726c58e76d06a7fe443d660 and foo-1.0.0-066d344ef7a60d67e85c627a84ba01c14634bea93a414fc9e062cd2932ef35df would both describe the same package foo-1.0.0-bea93a414fc9e062cd2932ef35df066d344ef7a60d67e85c627a84ba01c14634 and would only be stored the one time. Of course, you then have to keep track of $out -> $cas mapping, which you can do with symlinks perhaps?

ebkalderon commented 3 years ago

No, I am using the CAS hash in the aforementioned example. However, your suggestion made me realize that the crux of my example does not make sense. Indeed, the entries in the packages directory are already derived from the CAS hash of the .pkg tree objects, which would mean that the described scenario where two packages, otherwise identical on disk, exist with different hashes is impossible. Provided that every blob is hashed modulo its self-references, there should be no issue.

The secondary concern of not being able to cryptographically verify or reconstruct the Merkle tree without an external table of self-references on hand hasn't yet been addressed, though. There could potentially be a way to incorporate this information as yet another object type in the Merkle tree, or possibly as a special resource distributed separately alongside the tree itself, e.g. refs as seen in Git. I'll have to investigate further.

wmertens commented 3 years ago

I've discussed this project with @Ericson2314 and he had the idea of a FUSE filesystem that can serve a filesystem directly from the Merkle store. It would be a bit slower than a real filesystem but that's a matter of optimization. This would remove the duplication on small systems.

For maximum deduplication, how about adding an extra object type, file metadata, which stores store references and their offsets?

Directory blobs point to the metadata which points to the ref-cleared blob
Ref-cleared blobs have maximum deduplication possibilities
Self references are no different from regular references
To add extra runtime dependencies, add them to the metadata for the package root directory
Files that have special needs for patching can reference a patcher that does it
File permissions could be stored as well, for storing secrets

I'm excited by these ideas. This would solve so many things, and would make per-file store fetching possible. To install a package, check if the binary cache has the blob, and recursively ask for dependencies and files you don't have yet. Absolutely minimal installs.

wmertens commented 3 years ago

I'm also wondering how to keep the store network-shareable. If you do a single file per CA blob, it's trivial, and installs would consist of hardlinking regular files directly and "realized" files in a separate directory.

(BTW in the readme you have the store subdivided by the first byte, this is actually less performant on modern filesystems. Case in point, the nix store hardlinking uses a single directory with all the CA files)

Another option would be to use the git store, I wonder if it can be used by multiple writers at the same time. It has several decades of experience with blob storage.

wmertens commented 3 years ago

:thinking: actually, since directories are blobs, all the metadata can be stored in there, right?

wmertens commented 3 years ago

I wrote up how to put the Nix store in a git repo https://docs.google.com/document/d/1kfnu0y_x8spEO4JgynDE4Gp09MiCyAZ67kXQNGsMSpE/edit?usp=sharing

wmertens commented 3 years ago

I moved the writeup to https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

ebkalderon / merkle-tree-nix-store-thing