Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.63k stars 836 forks source link

Question: What is the benefit of using the SDK built-in otel tracing provider, over instrumenting the transport with otelhttp package? #21778

Open serbrech opened 1 year ago

serbrech commented 1 year ago

The SDK has a TracingProvider and integration with opentelemetry in its beta version. it seems to include logic about marking spans as error spans based on http status code, and more.

This particular logic is a generic http transport level logic and well handled by the opensource middleware for http transport. What is the value-added over the existing otelhttp package of the TracingProvider integrated in the SDK?

Shouldn't the SDK leave the http level instrumentation to the otelhttp middleware, and augment the trace with SDK level only spans and attributes?

https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp

jhendrixMSFT commented 1 year ago

The azcore/tracing package is meant to be an abstraction layer so that the SDKs aren't tied to any specific implementation. We then provide implementation-specific modules such as azotel that integrate with our abstraction layer. Or did I misunderstand the question?

serbrech commented 1 year ago

otelhttp middleware that I linked above deals with the span propagation and span status based on http status code, all at the transport layer. I can do that today, without the SDK even knowing about it.

opentelemetry sdk allows any library that integrates with it, to augment traces that are on the context already. I'm curious when I read code in the azure-sdk implementation that deals with the httpstatus code marking spans as error. it seems redundant, and a transport layer concern?

Why isn't the SDK just grabbing the span that's found, and adding some attributes, or maybe adding a span in the trace, and leaving the propagation/httpstatus handling to the http transport layer?

More specifically, this part of the integration:

https://github.com/Azure/azure-sdk-for-go/blob/17394dbbc2f79d10e7c7b04e5fa67b5bee8341f5/sdk/azcore/runtime/policy_http_trace.go#L67-L82

the otelhttp package does this by implementing a transport roundtrip:

https://github.com/open-telemetry/opentelemetry-go-contrib/blob/f567afc6e1900c0c958fa5f6ac2c2c18636bcdaa/instrumentation/net/http/otelhttp/transport.go#L125

the error is not an sdk error, but a http error.

jhendrixMSFT commented 1 year ago

Thanks for clarifying.

Our existing machinery conforms to the general guidelines for distributed tracing. If we were to use tracing at the transport layer, I believe it would prevent us from having a conformant implementation (e.g. suppression of spans for nested SDK client methods). Maybe there's a way to make this work but I didn't dig too deep. Perhaps a better way to say all this is that separating tracing from the transport allows us to independently enforce the semantics/behavior of each without mixing concerns.

CC @lmolkova

serbrech commented 1 year ago

I think you could create your span for the client method calls, but then let the transport middleware take care of the http part.

When client method creates a new span and internally calls into other public client methods of the same or different Azure SDK, spans created for inner client methods MUST be suppressed, their attributes and events ignored. Nested spans created for REST calls MUST be the children of the outer client call span. Suppression is generally done by Azure Core.

The transport can take care of the part I bolded above. it will automatically attach itself as a child of your SDK span. So the SDK could just care about the sdk attributes and leave all http concern to the transport middleware.

Once you hit the transport for an HTTP call, there is nothing to suppress anymore is there? any suppression would happen before that.

One reason that I'm interesting in this, is that we instrument all our http outgoing calls. SDK or not. If the ones instrumented using the community transport middleware gives us a completely different set of attributes, then we have to handle them differently in monitoring, alerting, dashboards, etc... standardizing the span attributes is an interesting feature IMO.

serbrech commented 1 year ago

Re-read myself, and just want to clarify that I have nothing against your current direction. my de-facto thoughts about open telemetry integration was about how how to integrate and leverage the community accepted http transport middleware, and augment it with azure-sdk specific attributes or spans.

That's the reason of this issue/question :).

Most of our internal needs are covered by instrumenting the transport. Our own service level traces already add attributes like correlation id, operation id etc...

For most ARM services, I think that the most important otel integration part is to augment the traces with some azure specific attributes. maybe SDK specific spans/events for the different pipeline middleware. The SDK implementation would ensure that they are standardized for Azure wide correlation and tracing. That would be a huge leap forward. (Same attributes across all SDKs used by all internal teams). Although one could argue that this can be achieved by ARM, since all calls transit via ARM and the trace context is then flowing with the set of attributes... :)

Where I'm more curious is how this will integrated with special or non-http SDKs (messaging, dataplanes, etc...)

jhendrixMSFT commented 1 year ago

The set of attributes we use are detailed here and conform to OTel conventions so there shouldn't be any differences. That said, I took a look, and we collect a subset of attributes compared to the roundtripper you linked (e.g. HTTP protocol version and request content length; there might be others). I'll defer to @lmolkova about why.

While we could integrate HTTP tracing via the transport, it would be a noticeable design change from our general design guidelines as we've viewed the transport and trace provider as separate entities. Not to say it couldn't be done, but I think at this point we'd need a really compelling reason to deviate.

Having our own HTTP trace policy also gives us complete control of the information included in traces. E.g. this can include the complete URL including unredacted query params which is against our design guidelines.

serbrech commented 1 year ago

Having our own HTTP trace policy also gives us complete control of the information included in traces. E.g. this can include the complete URL including unredacted query params which is against our design guidelines.

I would see this as an opportunity to contribute upstream and allow customizing what is included in the trace ;) but it's design details at this point, and I see how keeping control all the way down matters for you.

I suppose that having the transport middleware would still work. It would grab the current trace, add its own span/attributes, and override the traceContext that is on the request just before it goes out. so I think it would still be compatible in case both are set.

lmolkova commented 11 months ago

Just noticed this issue.

In general, we'd rather use otel instrumentation, but:

given everything needs to be explicitly configured in golang, having our own tracing policy simplifies configuration without taking anything away from users.

Let me know if I'm missing something.