bazelbuild / remote-apis

An API for caching and execution of actions on a remote system.
Apache License 2.0
337 stars 118 forks source link

Execute/ActionResult Rejection Response #108

Open werkt opened 4 years ago

werkt commented 4 years ago

Executions are capable of depending upon or producing invalid definitions or products, respectively.

Currently, the remote system has limited options for influencing the behavior of the client, exemplified here with bazel. This sequence is very complicated and boils down to:

  1. Any stock RemoteRetrier 'retriable' status codes in the ExecuteResponse status field trigger a waitExecution.
  2. A FAILED_PRECONDITION status in ExecuteResponse with an array of ExecuteResponse->Status->Details[PreconditionFailures] with only "MISSING" types triggers a restart of the execution loop, starting with ensureInputsPresent, if the outer execution retrier has not been exhausted.

If the execution retrier is exhausted, the client will revert to fallback behavior, either local execution or failing.

Currently, a remote instance may identify subsequent requests on the action (pursuant to RequestMetadata) and coordinate a short-circuited response (of more like [2] above).

But this practice is awkward, and depends upon a reasonable retry count and hopefully no exponential backoff for any sub request. Can we provide/standardize a status like GOAWAY that indicates that an action is unsuitable for remote execution or caching?

bergsieker commented 4 years ago

George, can you provide a driving use case for when this would be useful? What does it mean (to you) for an action to be "unsuitable"?

werkt commented 4 years ago

Currently, we reject actions that a) depend on oversized inputs and b) produce oversized outputs. We accomplish this for 'a' by passively ignoring (through findMissingBlobs manipulation) requests until we have a clear channel (the execute Operation response) to present a PRECONDITION_FAILURE or inspire a local fallback in bazel. For 'b' we supply this PRECONDITION_FAILURE as an action result for an oversized output.

Oversized in this case varies by heuristic, but can be thought of as a simple file size: say 1G.

There is future work to systematically apply policies to actions' behaviors, like never accessing above a percentage threshold of their inputs via the filesystem, producing orders of magnitude larger outputs than inputs, executing blacklisted programs, or other abusive behaviors involving resource acquisition - spending a threshold percentage of time blocked on IO, network, or just sleeping, etc.

ulfjack commented 4 years ago

I don't think it's safe to assume that there is an exact 1:1 correspondence between execute and findMissingBlobs calls. Bazel could generate one call to findMissingBlobs for a Bazel action, but make multiple execute calls for different subsets of inputs.

That said, some of the other policies sound intriguing.

werkt commented 4 years ago

I don't think it's safe to assume that there is an exact 1:1 correspondence between execute and findMissingBlobs calls. Bazel could generate one call to findMissingBlobs for a Bazel action, but make multiple execute calls for different subsets of inputs.

Precisely why I want a way to tell the client to go away directly. These are all hacks.

EricBurnett commented 4 years ago

From the initial message, PRECONDITION_FAILURE doesn't get retried in normal cases, right? If so, then standardizing a PRECONDITION_FAILURE 'details' field to indicate more explicitly that the RPC was rejected by the server SGTM.

I'm not sure I understand the connection to retries though? I don't think this would get retried at all by default?

On Mon, Mar 9, 2020 at 10:32 PM George Gensure notifications@github.com wrote:

I don't think it's safe to assume that there is an exact 1:1 correspondence between execute and findMissingBlobs calls. Bazel could generate one call to findMissingBlobs for a Bazel action, but make multiple execute calls for different subsets of inputs.

Precisely why I want a way to tell the client to go away directly. These are all hacks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/108?email_source=notifications&email_token=AABREW7UTJPQN5FP7THKINTRGWYEPA5CNFSM4J3DLFTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJYHPY#issuecomment-596870079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREWZSIJ2JIGLFLAC4RH3RGWYEPANCNFSM4J3DLFTA .

bergsieker commented 4 years ago

I think that PRECONDITION_FAILURE is the wrong thing to return in this case--it implies that there's something about the system that can be changed to make the action acceptable (in this case, we interpret it to mean that there were missing inputs, and that uploading more files would fix the issue). I think that INVALID_INPUT is probably more accurate--the action simply is not suitable for remote execution.

That still leaves us with the problem of separating this type of invalid input from other types. I'm not sure that Java supports adding details to a status object.

EricBurnett commented 4 years ago

I'm fine with using INVALID_INPUT, sure. I don't think that changes anything materially (it's still a non-retriable error that the client can fall back or exit out as it sees fit).

sstriker commented 4 years ago

@bergsieker, @EricBurnett, did you mean INVALID_ARGUMENT? If so, that seems fair to me. @werkt, assuming you are still interested in having this, care to raise a documentation PR? The error is already listed as one that can be returned, as such I think it is only a clarification.