alercah commented 4 years ago

I was speaking to @edef1c this past weekend about some of her work on Nix, and we ended up talking a bit about the REAPI and the differences between Nix's build API and REAPI. She was interested in REAPI, but one feature that Nix has that this API does not is the ability to queue entire sequences of actions, rather than having the client manage the build graph entirely on its own.

In particular, REAPI requires that a client wait until it has received the finished response from the server before it will know the digests of the output files. This is a prerequisite to being able to upload the next action, since the input Merkle tree will depend on those digests. The client will in fact have to upload all of those digests before launching the next action. This requires at least two round trips in order to send the digests to the client, receive the new action content upload (which could easily include a number of directory nodes), and then receive the execution request.

Another consequence of this design is that the client doesn't know the hash of an action until it its input digests are known. This also means that in every build, the client must query the action cache for every action whose result it does not already have cached client-side, even if the entire build is already cached and only the final output is needed.

After some chatting, we sketched out a design which I offered to write up and share with the group. My recollection is that Bazel is in theory interested in supporting this functionality as well, but due to architectural limitations it would be a major project. There are basically three features:

First, in the input tree for an action, instead of storing a content digest for a file or directory, the client can instead provide a pair of (action digest, path). This indicates that the file or directory is to be obtained by looking up the digest in the action cache and using the digest of the corresponding path in the output (which need not be listed as an output path; it could be a prefix of one or a path inside an output directory).
- If that action is a failure (nonzero exit code), or the requested output path is missing or incompatible with the input specification (e.g. a directory was present where a file was requested), then the outer action also fails, with the action result encoding a list of failed transitive dependencies. This allows a client to easily query those actions' logs.

This change allows actions to be specified whose inputs are outputs of actions whose results are not yet known by the client, thereby allowing action hashes to be computed from the original source files and the build graph. Actions encoded in this way can be checked against the action cache without knowing any intermediate output hashes, if the client is only concerned with the ultimate output.

Assuming that actions are properly reproducible and always produce the same output, this preserves the relationship that actions with identical hashes must have identical input file content and directory layout, by induction on the structure of the build graph. Nondeterminism in an action will be passed on to dependents, however. Currently, if action A is fully deterministic and depends on action B, and B is nondeterministic, A will always produce the same result since its hash encodes the specific files used from B. But if this approach to inputs is used, the input to A will change if the entry for B in the cache is updated. This could result in the output of A changing.

It might be desirable, to speed retrieval of intermediate results, to also add an API to get a file by (action digest, path) pair, so that the client does not need to download intermediate results.

Second, if supported by the server, an Execute request can be made where content needed is missing from the CAS or from the action cache. This could include the action digest itself, command, platform, or any action or content digest in the input tree.
- When the server receives an Execute request, it first checks the cache if applicable and returns a result as usual if there is a hit. If there is not, and if any necessary content is missing from the CAS or action cache, the server puts the operation into a "pending" state. It is not queued for execution; instead, the server returns a list of the missing digests to the client. If an execution request for an action is pending or queued, then the action digest is not missing, but the operation is still pended.
- An operation in the pending state will time out after a relatively short period if no further action takes place, but it will not time out if operations exist for all its dependent actions and uploads are in progress for all their dependent content. Time spent pending does not contribute to the execution timeout.
- When missing content digests are uploaded or when a request to execute an action is received, the server updates the list of missing digests on the pending operation as appropriate. This may result in the addition of new missing digests, such as if a directory node is made available and its children nodes are missing from the CAS.
- When a dependent action fails, the pending operation fails due to a failed dependency, as above.
- When all missing content is uploaded and all dependent actions are complete, the operation is queued for execution.
  - If a dependent action is uncacheable, then its output will be used for all pending actions when it completes, but not kept beyond that.
- The server should strongly consider supporting cancellation of pending operations.

The purpose of this pending functionality is fundamentally to cut down on round trips to the server in order to launch actions. The core functionality is that, by using inputs that refer to the outputs of other actions as proposed in the first feature, the client can pend an operation while its dependent operations are in progress, and then the server can automatically launch it as soon as possible, removing round trips from the critical path.

A particular feature about this design is that the server will accept an action that is still missing components to be uploaded to the CAS, or depends on actions that are not yet pending. It would certainly be simpler to require that all CAS blobs be present and that all dependent actions be pending or queued before accepting an action, and this would still eliminate the round-trips.

However, with action hashes becoming computable without needing to know the intermediate outputs, a client that does not care about intermediate results, and that thinks cache hits are likely, will generally want to check the cache for the final output first, and work its way back up towards the leaves, rather than starting from the leaves and working its way to the root. As soon as it finds a cache hit, it can skip processing that entire subtree. It would be relatively natural for the client to start the process of requesting executions from the root as it does this traversal. Requesting execution from the leaves would require a second traversal, possibly interleaved with the first for even more complexity. This requires, however, that the server is willing to accept actions whose dependencies are not yet received for execution.

Without accepting missing content as well, however, the client must upload all inputs before it can request an execution. Uploading content root-first is an unappealing choice, as it's highly speculative on the success of the earlier actions. Where bandwidth is limited, this may significantly delay the launching of the first actions on the build (even checking the cache root-first delays the first actions, but it trades off with faster cache hits on larger subtrees; there's no tradeoff with input content).

@edef1c had proposed allowing references within the stream in order to cut down on bandwidth costs, e.g. "action 2's input is the output to action 1", but this would require a lot of work to handle actions stored in the CAS, is stateful, and would also require a complex forward-referencing system if the client wishes to traverse from the root.

This more general design has three other benefits: First, we can cut out additional round trips by having the execute request also act as a cache check, since it now can succeed regardless of the CAS state. This benefit can be enjoyed by all executions, even without pipelining actions. Second, without allowing missing inputs, uncacheable actions would not be pipelineable, nor could they even have their outputs referred to by hash. Pipelining could perhaps work, but the client would be racing to upload the dependent action before the first action completes. This way, however, the client can upload the dependent action first so it will definitely be pending before the dependency completes. Another alternative, which may nonetheless be worthwhile to handle edge cases or when the client wishes to traverse in the opposite order, could be to allow a client to upload an uncacheable action result solely for the purpose of providing inputs to pending actions, with the server discarding it immediately afterward.

Third, many pending executions give the scheduler more information about the build graph. It could, for instance, prioritize jobs deeper in the build graph with a view to reducing the critical path time. Currently only the client has this information, and only limited tools to coordinate prioritization with the server.

Third, a new autolaunch flag is added to the ExecuteRequest. If set, then while the operation is pending, if a dependent action is available in the CAS, and it is not already executing, the server automatically creates a new operation for that action with the same execution parameters as the outer one. The ExecuteOperationMetadata will list all of the dependent operations being executed, so the client can query their status directly. The metadata's list of missing digests will include the lists of transitive dependencies.

This feels like more of an optional addon, but it would allow the client to work on uploading and queueing inputs in the context of a single controlling ExecuteRequest. The client would then not need to query dependent operations directly.

The main drawback to this approach is that the list of transitive missing digests could become quite large, and sending an entire copy of it every time there is a change could become a massive waste of bandwidth. Changing to a semi-stateful model, where the client is given an initial list when it first requests execution or reconnects with WaitOperation, but thereafter receives only incremental updates, might improve communication significantly.

A postscript

155 proposes adding a dynamic content-fetching system so that the client can upload inputs it didn't know were needed mid-flight. We had all the discussion leading to the above design sketch before I poked my head over here and discovered the remarkably coincidental timing.

Given that it also proposes an incremental, chatty protocol for the client which can include requests for additional data, it seems to me that these two proposals dovetail nicely together. I don't see any obvious incompatibilities, but I do think that they should be carefully designed together.

sstriker commented 4 years ago

Thanks!

I think this would be an additional proposal to the ones that made it to the list last year: https://groups.google.com/g/remote-execution-apis/c/MbT8TCxlOss/m/gYlHs4Y1AAAJ https://groups.google.com/g/remote-execution-apis/c/x_kKPl50Kt8/m/MxOnXoZ5BQAJ

It would be great to see if we can think of ways to do this in a v2 compatible fashion or if not, have this as a v3 idea. Would you mind reviving the thread on list?

alercah commented 4 years ago

I'd prefer to minimize my participation in the active discussion, to the extent that I can.

On Fri, 14 Aug 2020 at 15:53, Sander Striker notifications@github.com wrote:

Thanks!

I think this would be an additional proposal to the ones that made it to the list last year:

https://groups.google.com/g/remote-execution-apis/c/MbT8TCxlOss/m/gYlHs4Y1AAAJ

https://groups.google.com/g/remote-execution-apis/c/x_kKPl50Kt8/m/MxOnXoZ5BQAJ

It would be great to see if we can think of ways to do this in a v2 compatible fashion or if not, have this as a v3 idea. Would you mind reviving the thread on list?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/157#issuecomment-674246762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7AOVN57GHU2ORBLOWGSCDSAWI5BANCNFSM4PJ665KA .

ulfjack commented 4 years ago

It is possible to do this in a way that is backwards-compatible with v2. It is already possible to attach arbitrary key-value pairs to inputs in the merkel tree and these can be used to attach the dependent action digest and path. It would have to be gated on a capability returned by the server (i.e., clients must only make such request after verifying that the server supports them).

bazelbuild / remote-apis

Action pipelinining and automatic queuing #157

155 proposes adding a dynamic content-fetching system so that the client can upload inputs it didn't know were needed mid-flight. We had all the discussion leading to the above design sketch before I poked my head over here and discovered the remarkably coincidental timing.