GoogleCloudPlatform / opentelemetry-operations-go

Apache License 2.0
134 stars 103 forks source link

Spans aren't being marked as errors in Cloud Trace #730

Open andypwarren opened 1 year ago

andypwarren commented 1 year ago

Hi,

I'm instrumenting a gRPC server with OpenTelemetry and Google Cloud Trace. I can see spans in my Trace dashboard but they aren't being coloured red if an rpc returns an internal error. I'm using the otelgrpc.UnaryServerInterceptor() (code here) which calls span.SetStatus with the otel error code and the grpc message if any of these statuses are returned

I've also tried calling span.SetStatus outside the interceptors and Cloud Trace doesn't colour them red either so I don't think the problem is with the interceptor code.

I've created a simple demo app to reproduce this using the example grpc-go Greeter service with the addition of tracing using otel and cloudtrace.

When forcing a request to fail this is what I see in cloud trace

Screen Shot 2023-10-04 at 16 21 05 The interceptor has added the attribute rpc.grpc.status_code: 13 but the span status isn't showing up.

Ideally this would produce a red dot in the trace graph and the span would be coloured red.

Many thanks,

Andy

dashpole commented 1 year ago

This seems suspicious... https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/6b51e22bb5a89706db8e631469b1a491a0a693c3/exporter/trace/trace_proto.go#L162-L163

dashpole commented 1 year ago

Seems potentially related to https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/issues/143

dashpole commented 1 year ago

@aabmass do you remember why we set codes.Error to codepb.Code_UNKNOWN?

aabmass commented 1 year ago

Unknown represents an unknown error, along the lines of HTTP 500 status code. Since OTel only has two possible statuses (OK and ERROR), gRPCs UNKNOWN (error) seems reasonable.

Do you know what status codes actually show red in Cloud Trace?

dashpole commented 1 year ago

It does seem like we are doing the right thing based on https://pkg.go.dev/google.golang.org/genproto/googleapis/rpc/code#Code

// Unknown error. For example, this error may be returned when // a Status value received from another address space belongs to // an error space that is not known in this address space. Also // errors raised by APIs that do not return enough error information // may be converted to this error. // // HTTP Mapping: 500 Internal Server Error Code_UNKNOWN Code = 2

Do you know what status codes actually show red in Cloud Trace?

I'll see if I can find the answer to that question.

dashpole commented 1 year ago

I tested all status codes, and none appear to make the span look like an error

dashpole commented 1 year ago

I'll reach out to the trace UI team.

BradleyChatha commented 1 year ago

For further context, the way we're getting around this currently is by setting the attribute /http/status_code to 500 regardless of whether the context is for a HTTP server or not.

It seems to be the only way to make the trace UI render it as an error.

andypwarren commented 1 year ago

Hi @dashpole, is there any update on this?

dashpole commented 1 year ago

The cloud trace folks are aware of the issue, and suggested the same workaround pointed out above: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/issues/730#issuecomment-1755622579. I'm not sure about timelines, but i'll post here when there are updates.

shraddhaag commented 10 months ago

+1. We are facing this problem as well. Thanks for pointing to the workaround!

aabmass commented 3 months ago

Lowering to p3 since the workaround is sufficient

nikolaydubina commented 1 week ago

workaround is odd. adding http status code to code just so Google Trace can recognise, is not good. for example, if there is custom span (e.g. span_kind worker or consumer in open telemetry lingo) then Google Trace is not helpful. At least being able to configure what user threats as error would help. For example, in Grafana this is possible.

  1. using custom solution just so Google Trace can work, is extra effort on developers and not good for cross-vendor compatibility
  2. Open Telemetry already defined very simple and minimalistic Ok and Error span statuses, that would work for any language, out of the box, for all systems (that use open telemetry). why not to use it?

I spent some time to find where the error happen buy looking at the span attributes and finding grpc status 7. no colors at all, all blue. very hard to read. Google Trace needs to improve.

finding error traces, is very important user feature. I hope you guys at Google realise that if this does not work, this is 👎🏻 for Google Trace Explorer product.