V3 idea: Loosen the restriction of action input as Merkle tree

bazelbuild / remote-apis

An API for caching and execution of actions on a remote system.

Apache License 2.0

333 stars 117 forks source link

V3 idea: Loosen the restriction of action input as Merkle tree #141

Open moroten opened 4 years ago

moroten commented 4 years ago

I have observed that Bazel can spend a lot of CPU resources calculating merkel tree digests. This has been discussed in https://github.com/bazelbuild/bazel/issues/10875 and Extend the Action Cache with alias digests.

The key point is that the single input Merkle tree only needs to be resolved on cache miss, which should be rare, so the client should be allowed to check for cache hit using something else.

One idea was to create an alias cache entry where the client would be able to calculate the digest in any suitable way. The problem is that the alias has to be uploaded by the clients, a trusted CI machine or an untrusted developer machine, but not by the remote execution server side. Therefore, using action cache alias makes the system vulnerable for cache poisoning.

Instead, @EricBurnett suggests to loosen the restriction on the input to describe partial trees:

https://github.com/bazelbuild/remote-apis/issues/140#issuecomment-636983411 For merkle trees as inputs, the general properties we care about are:

Recursively defined, so that sharing trees in inputs doesn't require operating on a whole tree each time

Parallelly uploadable, so that it doesn't add unnecessary round-trips on the order of the depth of the tree.

https://groups.google.com/forum/#!msg/remote-execution-apis/F0Qb4m0J4Vg/QANi1BMdAgAJ I will note that Merkle Trees, when used as inputs, are defined as they are to achieve:

Reusability (sub-trees shared by two actions will share Merkle Tree nodes),

Determinism (the same set of inputs will always get the same tree, regardless of client)

What would be a good design?

Extend message Directory to include more extra roots, not just subdirectories?
Let Action.input_root_digest be repeated?

Any other design ideas or any ideas to solve the problem in a totally different way?

aiuto commented 4 years ago

cc: @meisterT @ulfjack

EricBurnett commented 4 years ago

Thanks for filing this Fredrik!

Any other design ideas or any ideas to solve the problem in a totally different way?

We could also allow the DirectoryNode messages within Directory overlap - essentially allow aliasing at any part of the tree, not just the root. I'm not sure if that's actually more useful than repeating input_root_digest, but it has the benefit of not needing proto field changes, only a relaxation of what trees a server chooses to accept.

// (Directory proto)
{
  directories: [
    {
      name: "src",
      digest: {
        hash: "4cf2eda940...",
        size: 43
      }
    },
    {
      name: "src",
      digest: {
        hash: "ab1242eeaf...",
        size: 42
      }
    }
  ]
}

I've been pondering this off and on for the last few weeks as well, and overall I still like the sound of it, but think it'll need prototyping to demonstrate it would actually be a big performance win. E.g. for bazel, this would probably be a good if it ultimately produces a merkle tree with a few overlapping nodes at the root, but would probably break down if it produced a merkle tree with O(deps) overlapping nodes at the root, each of which is a linear chain of nodes down to one or two files - at that point the root node would be better off being a linear list of files.

It may also be necessary to allow skipping levels, if the most expensive cost is actually rebuilding N levels of merkle tree nodes for each action, e.g.

// (Directory proto)
{
  directories: [
    {
      name: "src",
      digest: {
        hash: "4cf2eda940...",
        size: 43
      }
    },
    {
      name: "src/lib/foo/bar/baz",
      digest: {
        hash: "ab1242eeaf...",
        size: 42
      }
    }
  ]
}

moroten commented 4 years ago

Nice and simple suggestion, Eric!

A simple implementation would be

SortedMap<string, Digest> files
HashSet<Digest> visitedDirectories

def visitDirectory(rootPath, rootDigest) {
    if not visitedDirectories.put(rootDigest).existed_since_before {
        directoryMessage = cas.getDirectory(rootDigest)
        for subPath, subDigest in directoryMessage.files {
            files.putAndFailIfDifferent(rootPath / subPath, subDigest)
        }
        for subPath, subDigest in directoryMessage.directories {
            visitDirectory(rootPath / subPath, subDigest)
        }
    }
}
visitDirectory("/", inputRootDigest)

deterministicInputDigest = sha256sum(files)

Two observations:

It is up to the remote cache implementation how to hash the input tree. It doesn't matter for the client. There is no need to build an actual Merkle tree, one can simply hash the flattened list of files, symlnks and (empty) directories.
This implementation would support to add name: "/src/lib" to reach back to the root again. Also .. could be considered, but I'm not familiar with potential symlink complexity that I hear about every now and then. This should make it efficient with regards to minimal amount of required nodes.

I've been pondering this off and on for the last few weeks as well, and overall I still like the sound of it, but think it'll need prototyping to demonstrate it would actually be a big performance win.

Your suggestion allows for still using the v2 proto for transport, so it should be fairly simple to add --experimental flags to Bazel.

Regarding performance win, my early alias implementation saves a lot of time: https://github.com/bazelbuild/bazel/issues/10875#issuecomment-604736140. I also added an in-memory action cache and in-memory CAS to Bazel removes most of the cache handling time:

https://github.com/bazelbuild/bazel/issues/10875#issuecomment-617410032 Using the fast cache based on depsets, the total execution phase duration shrank [from 50.24] to 35.26 s. ActionContinuation.execute took 7.38 s of which 0.961 s is the fast hashing. postprocessing.run took 3.27 s.

EdSchouten commented 3 years ago

Just out of curiosity, is this a consequence of Bazel not 'pushing down' the concept of a tree deep enough? I guess that if Bazel didn't provide us the inputs as a SortedMap<PathFragment, ActionInput>, but tree instead (where commonly reused subtrees are shared with other actions), it would have been easier to reduce the amount of hashing we do.

EricBurnett commented 3 years ago

@EdSchouten one benefit comes down to reducing tree merging - for any given action bazel today has to merge the dependency trees of its direct dependencies, which will be partially but not completely overlapping. So it has to walk the ~full trees and recalculate a bunch of hashes for every single action, vs just re-hashing say one or two low-level nodes. So we're exploring ways to reduce that cost.

As a rough analogy, you can think of this like the difference between adding items to a heap one by one (O(nlogn)), or collecting all the items and then heapifying in one shot (O(n)). In the case of bazel, executors merging a whole bunch of mostly-redundant trees at once at execution time should be cheaper than bazel having to update a new canonical tree for each dependency in turn. And then on top of this, both moving the tree-merging behind the cache lookup (skipped in 90%+ of cases) and into a remote system (distributed, not bottlenecked on the builder resources) should make it an easy win on net - deferring the merge costs far enough that they mostly don't get paid at all, reducing the net cost when they do get executed, and distributing that.

I'm not sure the exact optimal algorithm, since you wouldn't want this to degenerate to "list everything linearly in a single root node", but e.g. "wait until you have N (10? 100?) redundant entries and then merge them together down one level" would presumably be much cheaper on bazel than full canonicalization, while keeping ~all the benefits of a recursive merkle tree in practice. @moroten also has a prototype in this bug that showed significant wins, but I didn't fully grok how it was structuring inputs so I'll leave it to him to describe :). Possibly more closely aligned with the existing nested sets in bazel, however they're structured today?

moroten commented 3 years ago

The implementation I've done in Bazel, https://github.com/moroten/bazel/blob/c01401f38ec9ed7122a53ee9d4cc673b30ba590b/src/main/java/com/google/devtools/build/lib/remote/FastInputKeyCache.java#L51-L91 (Add fast cache check for remote execution) on https://github.com/moroten/bazel/tree/optimize-remote-execution, views a depset as a directed acyclic graph and memoizes the digest of each depset globally. The same input file might exist in multiple copies in the DAG at this stage, but that's okay. (The depset implementation has recently changed to avoid accessing the internals, so there is no clean way to do this at the moment.)

Bazel then stores the "depset digest to merkle tree"-mapping globally. For now, I hijack the action cache by using the input tree digest and stdout fields. This creates an extra lookup round trip, but saves computational time on the Bazel client side. It would be better to have built in support for this mechanism in the ReAPI instead, basically sending the depsets.

In the Bazel case, partitioning the input tree according to depsets will create a simple and reasonable tree structure. It will more or less follow the build rule partitioning. There must be similar datastructures in other clients as well to handle transitive input files.

HALtheWise commented 3 years ago

Sorry for jumping late into this conversation, but I want to propose an alternate hashing scheme for file sets that I think addresses many of the concerns here. Basically, define the hash of a set of files to be the xor of the hashes of the (filepath, contents) tuple for each file in the tree. This is really cheap to merge for any nonoverlapping sets of files (even if they're intermixed in various subdirectories) and adding a file is now constant time, rather than linear in the depth of the tree. Interestingly, removing a set of files is also a cheap operation, as long as you know that all the files were previously in the set.

For a client like Bazel working with depsets, there are interesting options available where you cache hashes for each depset, then use graph-walking operations during a merge to figure out which graph nodes are duplicates in both sides of the merge, and simply subtract them back off.

moroten commented 3 years ago

@HALtheWise Interesting XOR hash suggestion! When checking the cache, only the root hash is needed which could benefit of the faster calculation. One could still keep the requirement of uploading the Merkle tree structure when execution an action, like in v2. (My goal has been to optimize the case of 99.99% cache hit.)

EricBurnett commented 3 years ago

Ooh, that's an interesting direction! I could definitely see it as a path to cheaper cache hits.

Relatedly, how would this work for trees that embed any data? (is_executable, node_properties, etc). In most cases I'd expect the set of paths to be unique enough, but that's not guaranteed by the spec and I can certainly imagine cases where there'd be collisions. (E.g. forgetting to set the executable bit -> correcting and re-running the same action).

I think (filepath, contents) are not quite sufficient as a result, though I could imagine replacing that with a struct for all the necessary metadata of a file. At which point I guess the hypothesis is that it's easier for build tools to keep track of or update mutually non-intersecting sets of files+metadata, relative to some common root, than it is for them to keep track of recursively subdivided file sets of the same? I'll admit, I don't know bazel's fileset use enough to quite understand that - I could see it in the simple case of "add the outputs of action X to this prebuilt fileset", but I'm not sure how it works for "unify the visible files from these N subtrees", as I assume it works for merging the inputs to an action.