Improve Remote Execution failure diagnostics

TheGrizzlyDev commented 1 month ago

when RE is the OSS variant always log the action digest
on any kind of failure, both user failures as well as RE failures, propagate the message from the execution as it can contain useful informations for debugging (like links to some UI)

JakobDegen commented 4 weeks ago

So I had kind of had this thought on a previous PR, but I think now might be the right time to bring it up:

It might be good to simplify the RE configuration and setup situation by making buck2 aware of different RE providers. Specifically what I have in mind here is a concept of "profiles," ie a buck2_re_client.provider_profile that could be set to "engflow," "buildbarn," or left as some default variant. This profile would primarily affect the choice of defaults for many of the other configuration knobs in the buck2_re_client section, but could also be used for things like in this PR, as well as maybe simplifying other things.

Concretely, the benefits I'm imagining for this are:

On the buck2 core side, it makes maintenance and review simpler. It's very easy for me to accept a PR from someone at engflow updating defaults for the engflow profile, it's much harder if I have to spend time thinking about whether this will make the experience worse for users of other backends - that's especially true because I often just can't know whether that's going to be the case, and exhaustive manual testing against other RE providers is really cumbersome.
For maintainers of the RE backends, this should make it easier to ensure that their users have the right defaults. There's no need to publish a page with "if using with buck2, we recommend this list of settings," and there's no risk of user's defaults being stale wrt the latest recommendations.
It might reduce the total number of configuration knobs that are needed, which is often a win for mental overhead.

Is this something that you would be interested in either using or possibly also doing a bit of refactoring to implement it?

If so, I will need to raise this internally and make sure that the whole team is onboard with it, but I don't expect it to be too controversial.

As to this PR, my concern with it is basically that while I totally believe that this is useful and important for you and your users, I have basically no confidence that this won't contain anything from redundant to actively misleading information for other backends, so I don't know that I can just take it as-is. If you wanted to unblock yourself quickly we could just put this behind another configuration option, but that felt a bit silly hence the suggestion above.

Other than that, I think this ideally would include not only the little bit of extra text at the end, but the "action digest" blurb would be a part of this configuration too. So specifically, I'm imaging that somewhere we'd end up with a match statement that looks like

match profile {
    Profile::Meta => format!("To reproduce: `frecli cas download-action {}`", digest),
    Profile::Default => format!("Action digest: `{}`", digest),
    Profile::EngFlow => error.message.clone(), // Or whatever else you wanted
}

TheGrizzlyDev commented 4 weeks ago

As to this PR, my concern with it is basically that while I totally believe that this is useful and important for you and your users, I have basically no confidence that this won't contain anything from redundant to actively misleading information for other backends, so I don't know that I can just take it as-is. If you wanted to unblock yourself quickly we could just put this behind another configuration option, but that felt a bit silly hence the suggestion above.

I can help you figure this out since luckily I work on a RE backend (EngFlow) :) The whole point of the field is to display additional diagnostics from a RE backend on error or on demand (this part is currently missing as it requires some thinking in terms of UX) https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto#L1500 . This would work regardless of the RE implementation as per protocol it is how it is intended to be used (though, I admit, this isn't very clearly conveyed by the field's name).

As for the profiles, I fear that introduces logic that shouldn't live in the build system, but rather in the service. The protocol is already designed to support a wide level of configurability and I don't really see any specific use cases that would benefit from such a behaviour. Even this PR, it really is just a step towards following the more obscure aspects of the RE-APIs, those that a protobuf file hides away and that require reading the docs in depth (or having experience with a RE backend). Wdyt?

JakobDegen commented 4 weeks ago

I can help you figure this out since luckily I work on a RE backend (EngFlow) :)

Yes, I know, that's why I said what I did :)

The whole point of the field is to display additional diagnostics from a RE backend on error or on demand (this part is currently missing as it requires some thinking in terms of UX) https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto#L1500 . This would work regardless of the RE implementation as per protocol it is how it is intended to be used (though, I admit, this isn't very clearly conveyed by the field's name).

Indeed, but also, I don't find the documentation on the field to be much more clear either. Out of curiosity, do you know what bazel does?

Even this PR, it really is just a step towards following the more obscure aspects of the RE-APIs, those that a protobuf file hides away and that require reading the docs in depth (or having experience with a RE backend)

I had looked for some more information on this field but didn't find any, although probably I don't know where to look. Do you have a link?

In any case, if you don't want to go the configurable route, I think I am going to have to ask you to actually test this PR against other RE backends and share some snippets of what the resulting effect on the error message is. Without that I basically just don't know what this change does.

TheGrizzlyDev commented 4 weeks ago

I could test it on other RE backends but I don't really see the point. The goal of that field is to contain any additional debugging information a RE backend wishes to share with a client in case of a failure and as such it'd work exactly the same way with any other RE backend. This is also how Bazel uses it. If you think it could help I wouldn't mind adding a configuration flag to disable it though. Having said so, this really isn't anything EngFlow specific, we're just following the spec here :)

JakobDegen commented 4 weeks ago

@TheGrizzlyDev the bar for including this kind of thing in the output isn't that the information just needs to be vaguely relevant. It needs to actually provide sufficient additional value on top of the existing error message to justify taking up additional screen space and making the error message more verbose. Specifically for action errors I think we're particularly sensitive to this, because we get it wrong already - our action errors are too verbose and hard to read and really we should be looking to reduce the information present in them.

Now if you're going to use this to show a link to some debugging UI, that seems like it might meet this bar and be a good thing to include in the output. But if other RE backends are only going to say something like "command failed with exit code 1" or repeat the action digest or something like that, that's not something we should just dump onto the user.

TheGrizzlyDev commented 3 weeks ago

@TheGrizzlyDev the bar for including this kind of thing in the output isn't that the information just needs to be vaguely relevant. It needs to actually provide sufficient additional value on top of the existing error message to justify taking up additional screen space and making the error message more verbose. Specifically for action errors I think we're particularly sensitive to this, because we get it wrong already - our action errors are too verbose and hard to read and really we should be looking to reduce the information present in them.

Now if you're going to use this to show a link to some debugging UI, that seems like it might meet this bar and be a good thing to include in the output. But if other RE backends are only going to say something like "command failed with exit code 1" or repeat the action digest or something like that, that's not something we should just dump onto the user.

Well, when the field is present then by definition it adds relevant information for debugging. Though there aren't any restrictions for the contents, you shouldn't expect anything like the message you mentioned either. In our case we use to share a link that is very useful for debugging. As far as I can tell other backends either do not implement it or use it to propagate similarly useful information (buildbarn does the same thing https://github.com/buildbarn/bb-remote-execution/blob/b669c2edbdf3661ea1bf39126a2d2d01f81ab161/pkg/builder/caching_build_executor.go#L62 )

TheGrizzlyDev commented 2 weeks ago

Ping :)

TheGrizzlyDev commented 1 week ago

Ping @JakobDegen

facebook / buck2

Improve Remote Execution failure diagnostics #656