bazelbuild / remote-apis

An API for caching and execution of actions on a remote system.
Apache License 2.0
343 stars 119 forks source link

REv3 idea: CAS.ExpandTree()? #140

Open EdSchouten opened 4 years ago

EdSchouten commented 4 years ago

Build systems like Bazel currently treat directory outputs as being opaque. There is no support for addressing individual files contained within.

If Bazel wants to pass on a directory output from one action to another, it currently needs to download a Tree object from the CAS and expand it into individual Directory objects. These then need to be reuploaded into the CAS.

This roundtrip could be avoided by adding a dedicated RPC of the shape:

rpc ExpandTree(Digest) returns (Digest);

In this case, the argument corresponds to the digest of the Tree object. The return value corresponds to the digest of the root directory contained within the Tree object. The call should also bump the TTL on all resulting CAS objects.

Do we consider this to be enough of a problem that it makes sense to add such an RPC?

ulfjack commented 4 years ago

That seems to be predicated on the assumption that Bazel is not going to allow addressing individual files in the future. That assumption does not seem correct to me. I'm not sure if there are features in Bazel around this right now, but it seems like a perfectly safe bet that there will be, even if Bazel team isn't currently planning to add them.

The reason for this is simple: adding unnecessary input files to an action is bad for build performance (reduces caching & increases action costs). This is particularly true the larger the output directory tree is, i.e., the more you're trying to save network bandwidth with this feature, the worse performance gets. Therefore, I would bet on Bazel adding features to allow addressing individual files or filtering files or something else along these lines that requires Bazel to have a local list of files.

EdSchouten commented 4 years ago

That seems to be predicated on the assumption that Bazel is not going to allow addressing individual files in the future.

No, that's not correct. It is predicated around the fact that in many cases directories declared using declare_directory() are simply passed on in literal form. If Bazel actually needs a subset of a Tree (which would likely only be used in certain cases), it is still capable of downloading the Tree object and carving it up as needed.

Even in the case where a Tree needs to be carved up, it might still be preferable to call ExpandTree() to at least store all subtrees that end up being used in literal form.

ulfjack commented 4 years ago

My position is that rules will (should?) ~always use a subset of larger trees in order to avoid the performance costs. Maybe that's overly optimistic, but even so, if that's the recommended best practice, why spend significant effort on optimizing the other case?

If the tree needs to be carved up, why would you have ExpandTree as a separate API compared to requiring action execution to generate / store all the subtrees?

EdSchouten commented 4 years ago

So you're proposing that GetActionResult() is made slightly stronger, in that it only returns results if all directories stored in a Tree referenced by the ActionResult are also stored in the CAS as separate Directory objects? That sounds reasonable.

peterebden commented 4 years ago

I've thought about suggesting something along these lines for a v3 proposal before, or at least sketching out the general problem of the asymmetry between the messages describing outputs & inputs and the work required to convert one to the other when another action wants to consume them.

So generally in favour, but would also question whether there is something we could do to make the two more similar (although that might be too big a change even for v3?).

mostynb commented 4 years ago

So you're proposing that GetActionResult() is made slightly stronger, in that it only returns results if all directories stored in a Tree referenced by the ActionResult are also stored in the CAS as separate Directory objects? That sounds reasonable.

I assumed that this was already required, but looking at the GetActionResult() comment again now it is slightly ambiguous.

EricBurnett commented 4 years ago

For a bit of background, the Tree handling in the API is a known weak spot

For merkle trees as inputs, the general properties we care about are:

Of course, this runs into problems for output trees - you would not want to expand a recursively-defined merkle tree by making an RPC for each level one-by-one - that'd be terribly performing.

So we started by adding GetTree to do a server-side expansion, then realized that for most use-cases we don't actually need the CAS to have special knowledge of trees at all. Interpretation of protos can be left in the Execution API, with the CAS operating on blobs only. Which led to the inclusion of OutputDirectory as a "flattened tree" - a separately serialized proto so that we don't have to worry about a whole tree fitting into one fixed-size RPC response, but still flattened under the assumption that clients will have to interact with the whole tree at some point.

But I see your point here - even with all directory nodes in the CAS individually (which is my interpretation of the comments as well, but slightly ambiguous as you say), you still have to download the serialized tree just to get to the root directory node, if that's all you want.

To that end, we might instead consider updating OutputDirectory (Edit: mistakenly said DirectoryNode) to have both the digest of the tree (for access to the flattened form) and the digest of the directory node itself (for reuse as an input)? Then it'd be clear that if the server gives you a directory root digest back, all nodes must be in the CAS too by definition, and you can reuse it as-is. This could be optional in V2; upgraded to Required in V3?

(I'd also love to get rid of GetTree entirely; can't remember offhand why it was added back though and if it still has a purpose).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/140#issuecomment-636940255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREW5PJDMMKBI6QWGBTZLRUPF5HANCNFSM4NPPEWOQ . [image: image.gif]

EdSchouten commented 4 years ago

To that end, we might instead consider updating OutputDirectory (Edit: mistakenly said DirectoryNode) to have both the digest of the tree (for access to the flattened form) and the digest of the directory node itself (for reuse as an input)? Then it'd be clear that if the server gives you a directory root digest back, all nodes must be in the CAS too by definition, and you can reuse it as-is. This could be optional in V2; upgraded to Required in V3?

:+1: That sounds like a good idea.

(I'd also love to get rid of GetTree entirely; can't remember offhand why it was added back though and if it still has a purpose).

:+1: to that as well. I can imagine that something like GetTree() makes sense to have on a worker where you might want to gather a full input root. Still, there's no need to let that be part of the build client -> cluster protocol.

juergbi commented 4 years ago

Could it make sense to unify Directory and Tree by extending the DirectoryNode proto to support an embedded Directory message as alternative to the digest? This would allow clients and servers/workers to flatten directory trees where it makes sense without forcing a completely flat directory for large trees (as is currently the case for OutputDirectory). It would still allow reuse of complete Directory messages via digest.

This would provide consistency across all operations and we should no longer need GetTree() or ExpandTree(), i.e., the CAS API wouldn't need any knowledge of the Directory proto. I think this would be a net benefit with regards to REAPI complexity overall.

A possible disadvantage is that the flexibility would no longer provide a single canonical representation of a directory hierarchy. However, I'm not sure whether this would be a big issue in practice as long as individual clients use deterministic rules to decide embedding of subdirectories.

This would also address #165 and #170.

Any thoughts? Are there significant disadvantages I'm overlooking?

illicitonion commented 4 years ago

I like it - the canonical digest of a Directory could still be as it is today, with implementations which inline subdirectories for transmission just needing to pay the extra cost of serialising them differently for digesting than for transmission.

EdSchouten commented 4 years ago

Are there significant disadvantages I'm overlooking?

The downside of such an approach is that there will no longer be a unique representation of a directory hierarchy. You can construct multiple identical directory layouts, but encode them differently:

It sacrifices the idea that a node in the Merkle tree of given contents has a unique digest. Couldn't this lead to lower AC hit rates if not all of the tools are in full agreement when to (un)pack directories in input roots?

Another issue with this approach is that the decision when to (un)pack is made unilaterally. The worker decides whether everything needs to be stored separately or together, whereas the client may have a different opinion on what would have been smart, looking at the build overall. For example, when Builds without the Bytes is turned off, it may be more preferable to receive a Tree-like object. When Builds without the Bytes is turned on, a plain Directory would have been sufficed (and be preferable).

This would provide consistency across all operations and we should no longer need GetTree() or ExpandTree(), i.e., the CAS API wouldn't need any knowledge of the Directory proto.

I would like us to go in a different direction here. If we ever want to support encryption (#133), it already makes sense to decompose CAS objects into two parts:

In such a model it would make a lot of sense to always store objects separate; never use something like Tree. The GetTree() call could be replaced by a generic GetTransitiveClosure() call to expand a full tree of CAS objects.

illicitonion commented 4 years ago

Are there significant disadvantages I'm overlooking?

The downside of such an approach is that there will no longer be a unique representation of a directory hierarchy. You can construct multiple identical directory layouts, but encode them differently:

  • Either pack all Directory objects together,
  • Store them all separately,
  • Have some directories packed, while others are stored separately.

Couldn't this lead to lower AC hit rates if not all of the tools are in full agreement when to (un)pack directories in input roots?

I think this is fine as long as the digest algorithm is well specified. If there's a canonical form for encoding them for digest, and effectively we're just hiding hint information in a separate un-digested field, these worries go away. (Of course, some new worries around correctness of implementation now appear!)

EdSchouten commented 4 years ago

I think this is fine as long as the digest algorithm is well specified. If there's a canonical form for encoding them for digest, and effectively we're just hiding hint information in a separate un-digested field, these worries go away. (Of course, some new worries around correctness of implementation now appear!)

Let's assume there is a canonical form for digest. Now let's assume a worker gives us an output directory of which we don't know whether it is in canonical form. Wouldn't that require the client to download the full output directory hierarchy, so that it may be canonicalised by the client?

The entire goal of this PR was to introduce a mechanism where clients do not need to download full directory output hierarchies.

EricBurnett commented 4 years ago

The downside of such an approach is that there will no longer be a unique representation of a directory hierarchy. You can construct multiple identical directory layouts, but encode them differently:

I'd like to break this anyways, personally - see e.g. https://github.com/bazelbuild/remote-apis/issues/141#issuecomment-638865999 . Fundamentally, what I've realized we actually need is approximately "For clients producing overlapping directory trees, they can easily get high cache hit rates" . But that's more about clients being internally-deterministic and well behaved - I don't think we actually see ~any cross-tool caching, and I don't think it actually matters. So we should optimize for e.g. allowing Bazel to produce trees in a way efficient for it to compose, and reasonably compact for expanders. And we should allow BuildBarn to do the same. But we don't really benefit from those trees being the same as far as I can tell, and I don't think most consumers of trees care if they're in canonical form or not? At most, they might care that the same tree from the same source is deterministic, to e.g. easily say "this output tree has not changed". But that's a lot more limited than what we enforce now, and requiring a fully-expanded non-overlapping tree seems to be holding us back in a lot of contexts.

To Daniel's point, there is still a canonical form that can optionally be used where needed. But I expect that's actually the exception and not the norm.

ulfjack commented 4 years ago

I'm generally in favor of a more lenient approach and also in favor of unifying the two tree representations.

I wonder how difficult it would be for a RE system to detect cases where a client sends the same tree in different representations. Such a mechanism could be used to detect misbehaving clients.

EricBurnett commented 4 years ago

I wonder how difficult it would be for a RE system to detect cases where a client sends the same tree in different representations. Such a mechanism could be used to detect misbehaving clients.

Servers could canonicalize a sampling of trees, and report on rates of collisions? (E.g. X client produced tree Y which canonicalized to Z; different from previously seen tree Y' from client X that also canonicalized to Z). Might be a bit hard to expose in an actionable way, but shouldn't be too bad I don't think as the server could reasonably track the invocation ID / action ID from this sampling and be able to point directly back to examples where it was concretely observed. But one thing we'd have to learn observationally is what a "reasonable" rate of collisions is.