cockroachdb / errors

Go error library with error portability over the network
Apache License 2.0
2.1k stars 67 forks source link

extgrpc: add support for automatic en/decoding of gRPC Status #65

Closed erikgrinaker closed 3 years ago

erikgrinaker commented 3 years ago

This patch adds support for automatic en/decoding of gRPC Status errors, enabled by importing the extgrpc package. It supports statuses generated both by the standard google.golang.org/grpc/status and the gogoproto-based github.com/gogo/status packages.

Encoded statuses are always represented as gogo/status-based Status Protobufs, since the EncodedError type is a gogoproto type using an Any field for structured errors, and only gogoproto types can be placed here since standard Protobufs are not registered in the gogoproto type registry. We lock the gogo/googleapis package to 1.2.0 to use gogoproto 1.2-compatible Protobufs required by CRDB, these have been verified to not be vulnerable to the "skippy pb" bug.

Note that to preserve error details, gogo-based Status objects with gogo-based details must be used, otherwise details may not be preserved. This is for similar reasons as outlined above, and is a limitation in the gRPC package -- see code comments for further details. A CRDB linter rule forbidding use of Status.WithDetails() was submitted in cockroachdb/cockroach#61617.

Resolves #63, related to #64 and cockroachdb/cockroach#56208.


This change is Reviewable

tooolbox commented 3 years ago

Thanks @knz It looks fine to me in terms of implementation and makes sense to maximize compatibility.

I guess I have only a comment on best practices or standards of usage, which may seem strange coming from an outsider, but since I have been using the extgrpc/middleware packages I wrestled with this and I'll explain what I mean:

If you have a service using cockroachdb/errors, and in particular the gRPC bits, I have seen a limited amount of scenarios where you want to return an actual gRPC Status. Since it's a sort of leaf, you can't truly wrap another error, there's no stack trace information, etc. That's fine if it's an OK, but if you are returning a non-OK code you are likely dealing with an error somewhere deeper in your service, either your own code or a library, and I personally would want to wrap all that information and send it to the caller.

That's why I contributed cockroachdb/errors/grpc/status, an offhand stab at a Status API that can preserve a stack of errors. I'm sure it lacks in various ways, but from a DX perspective it's truly awesome to have a failed call give you a stack trace that's 3 services deep. If you're returning raw gRPC Status objects, or their GoGo equivalents, you might be missing out on the last mile of that trace?

Again, that's just commentary. Code is LGTM.

erikgrinaker commented 3 years ago

I think you have a good point @tooolbox, but I don't think these approaches are mutually exclusive. I checked the CRDB codebase, and could only find two places where a gRPC Status actually wrapped another error -- all others were leaf errors triggered by some condition (e.g. a failed permission check). Where we're wrapping an error it'd probably be better to use WrapWithGrpcCode to preserve details -- and in that case we may also want to make withGrpcCode implement the gRPC status interface (i.e. add GRPCStatus() *Status) so that they can be used interchangably with the gRPC status package.

Regardless, I still think there's value in preserving a Status error as it passes through EncodeError/DecodeError. This currently results in an opaqueLeaf instead, which has no structure at all. This is a problem in CRDB, since we actually use EncodeError/DecodeError internally in the codebase while propagating errors up the stack (which arguably may not be a good idea in itself). This means that when a gRPC client call receives a codes.PermissionDenied and returns that up the stack, that information is lost and the caller is unable to determine that it was in fact a permission error and e.g. stop retrying the operation.

While we probably should improve the error handling in CRDB, that's a larger yak to shave, and I don't see any harm in preserving Status errors across encode/decode to solve the immediate problem at hand.

knz commented 3 years ago

This currently results in an opaqueLeaf instead, which has no structure at all. This means that when a gRPC client call receives a codes.PermissionDenied and returns that up the stack, that information is lost

Can you spell this out further? opaqueLeaf is meant to actually preserve details about the original error, so that it can still be decoded elsewhere. This makes it possible to start implementing a decode function without any corresponding encoder (e.g. it's possible to start decoding an error that's been encoded as opaque by a node running a previous version of the code, prior to the implementation of the decode function)

erikgrinaker commented 3 years ago

Can you spell this out further?

When we send a batch command to the target replica via RPC, we end up wrapping the resulting error here (roachpb.NewError calls EncodeError internally):

https://github.com/cockroachdb/cockroach/blob/d5f73c6a05c99f7c6b40b80dea80e8ed8ab15cac/pkg/kv/kvclient/kvcoord/dist_sender.go#L1533

Further up the stack the Protobuf roachpb.Error is converted back into a Go error (GoError calls DecodeError internally):

https://github.com/cockroachdb/cockroach/blob/daf42d6b8c87cbd3c0cf831b0cd8f7b97fafd3f5/pkg/kv/db.go#L724

A caller higher up the stack then tries to check whether the error is a gRPC authentication error by checking its status code:

https://github.com/cockroachdb/cockroach/blob/d5f73c6a05c99f7c6b40b80dea80e8ed8ab15cac/pkg/util/grpcutil/grpc_util.go#L96-L104

However, at this point error is opaqueLeaf, so status.FromError() is unable to convert it into a gRPC status, and IsAuthError returns false. I don't believe opaqueLeaf can preserve the Status Protobuf automatically, because it's a standard Go Protobuf which isn't recognized by the Gogo decoder, and thus it can't populate opaqueLeaf.details.FullDetails which is a Protobuf Any field.

knz commented 3 years ago

I don't believe opaqueLeaf can preserve the Status Protobuf automatically, because it's a standard Go Protobuf which isn't recognized by the Gogo decoder

ah! that's the bit I was missing. That is indeed unfortunate.

tooolbox commented 3 years ago

Gotcha @erikgrinaker, yeah if your service(s) are shallow enough that leaf error generation sites know they're in a gRPC service, then great.

we may also want to make withGrpcCode implement the gRPC status interface

Perhaps, yeah. I would like to make sure that the grpc/middleware package still works transparently, as intended.

I don't see any harm in preserving Status errors across encode/decode to solve the immediate problem at hand.

For sure, agreed.