dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.24k stars 4.73k forks source link

Support for tracestate collection in Activity of DiagnosticSource #26463

Open SergeyKanzhelev opened 6 years ago

SergeyKanzhelev commented 6 years ago

Motivation

W3C spec for distributed tracing context defines a new concept of tracestate. In order to implement W3C specification this concept should be implemented in Activity class.

Tracestate collection should allow to make multiple types of modifications:

CC: @seth-capistron, @vancem, @jacpull, @lmolkova, @glennc, @davidfowl

Proposed API

namespace System.Diagnostics
{
    public class Activity
    {
        ...
        // <summary>
        //  List of key/value pairs representing the tracestate field of distributed tracing context.
        //  https://github.com/w3c/distributed-tracing/blob/master/trace_context/HTTP_HEADER_FORMAT.md#tracestate-field
        //  TraceState is a list of named opaque values. Both key and value has a set of limitation imposed by W3C specification.
        //      key:  
        //          Identifiers are short (up to 256 characters) textual identifiers. Note that identifiers 
        //          MUST begin with a lowercase letter, and can only contain lowercase letters a-z, digits 0-9, 
        //          underscores _, dashes -, asterisks *, and forward slashes /.
        //      value: 
        //          Opaque string up to 256 characters printable ASCII RFC0020 characters (i.e., the range 0x20 to 0x7E) 
        //          except comma , and =. Note that this also excludes tabs, newlines, carriage returns, etc.
        //  There is also a limit on overall length of the TraceState collection. For complete specification refer to W3C document.
        //
        //  Setting of key/value pair with incorrect key or value will result in ArgumentException.
        // </summary>
        public SortedDictionary<KeyValuePair<string, string>> TraceState { get; }
        ...
    }
}

Notes

Usage

Read tracestate key's value

This example shows how the value of sampling-score which is double can be read from TraceState.

var samplingScoreStr = Activity.Current.TraceState["sampling-score"];
var samplingScore = string.IsNullOrEmpty(samplingScoreStr) ? 1 : Convert.ToDouble(samplingScoreStr);
if (samplingScore > sampling)
{
    // Do not track telemetry.
}

Change of tracestate should be reflected in every child activity of this activity. Supported mutations of trace state:

Update key value

Activity.Currnet.TraceState["sampling-score"] = (0.23).ToString();

Add new key-value pair

Any other values with the same name should be removed from the list.

Activity.Currnet.TraceState["sampling-score"] = (0.23).ToString();

Delete the key-value pair

Activity.Currnet.TraceState.Remove("sampling-score");
SergeyKanzhelev commented 6 years ago

Some additional thoughts on proposed API after discussion with @adriancole:

I will update Usage section to indicate common vs. uncommon scenarios. Please comment if you have feedback

SergeyKanzhelev commented 6 years ago

Comment from @lmolkova before - APIs of Activity do not throw generally. Altermative is to swallow exception or clean up data (e.g. lowercase the key). @karelz any guidance here?

karelz commented 6 years ago

I let area experts - @vancem @brianrob to chime in. IMO we should be resilient to invalid data. Ignoring them or cleaning them up sounds reasonable.

vancem commented 6 years ago

I would like to separate the 'HTTP-like stuff' from the Activity class. Its job is to be a container that propagates things in a way useful for causality tracining. Stuff associated with HTTP conventions for encoding and parsing stuff does not really belong here. Also the W3C rules for propagating this value don't belong in activity.

My knee-jerk reaction is that 'tracestate' should be just a piece of baggage (That is the whole string that encodes all the pieces just gets put into the 'traceState'). To do this well I think it is useful to have a UTF8String version of the baggage APIs (so you don't have to convert your UTF8 to strings which is nice), and I am also open to a 'Remove' API for baggage if we need it.

Effectively 'traceState' just becomes one more piece of baggage, which you already have to enumerate tu support 'correlation-context'. Thus you just need to watch for 'traceState' as you process that and encode/process it specially when you see it.

I do think having UTF8String (bascially Span) versions of the Baggage APIs makes sense so as to avoid lots of allocation / parsing.

SergeyKanzhelev commented 6 years ago

@vancem tracestate is not a collection of tags associated with distributed trace. It is a current position in a distributed graph of a specific vendor. It may be updated on every request to indicate the "last seen span-id" that belong to the current vendor. There is an explanation in section 2.1 of https://w3c.github.io/distributed-tracing/report-trace-context.html

Also the concept is not http specific. Unless your comment was about lazy parsing. Than I agree that base Activity class may not be the best place for it.

That's said the idea to put it in Correlation Context will require some extra parsing and memory allocations. Especially if one will need to update the value in tracestate for every outgoing call. Which is the case for any hierarchical state implementation.

Does it make sense with this semantics of tracestate explanation to create a dedicated collection? Or you feel strongly about placing it into Correlation Context collection?

vancem commented 6 years ago

@SergeyKanzhelev. Yes I know that tracestate is a list of IDs that represent transition points. But it is also just a string that needs to be propagated through user-code so that when other operations (e.g. HTTP sends) happen, knowledgeable code can update the state and send it out.

In that sense FROM ACTIVITY's point of view, tracestate is just a piece of baggage (something it remembers so that other things can use it. The only thing that Activity has to do is remember it (and propagate it to all its children).

Yes, something needs to update tracestate. I am just saying that we have the choice of what that is, and arguably all that logic ideally would be together. Thus I am arguing that logic that hooks request can also do the update and update the 'tracestate' key of the baggage to be the new string that will be written to output requests. Activity's job is simply to remember it.

Note that Activity's baggage is actually NOT a dictionary, and in fact, my suggestion is we make it so that you pass in Span for the key and values as UTF8 blobs, and Activity pretty much simply saves a big blob of concatenated blobs.

Also note that hopefully tracestate is in fact empty most of the time and tends not to be modified (this is the single vendor case). It seems like other pieces of the correlation context might be at least as likely to be updates as this piece (is there is not a strong perf reason to treat it specially, we need to make fiddling with the correlation context reasonably efficient)

I am not trying to be difficult, but I do think the devil is in the details, and we are talking about very hot code paths (anything that is per-request or faster needs careful consideration). This means we want to be super lazy about parsing (especially if common scenario may be pass through, or a simple morph that could be done without exploding the data into some data structure (like a SortedDictionary).

I am OK with having a string or UTF8String blob tracestate property on Activity (in general I am OK with 'well known' pieces of baggage, since it encourages type safety, discoverability. Do you want to do that?

I think the more interesting/probelmatic issue is what we do about IDs.

SergeyKanzhelev commented 6 years ago

@vancem I'll need to think about it and prototype a little to understand the most common patterns. Two concerns I have immediately are:

  1. With unparsed tracestate is that it moves knowledge of format from technology that recieved that field in one form or another (http vs. grpc vs. amqp...) to consumer of that information that ideally should be protocol-neutral. We need to understand whether tracestate is compresseable at all or will have the same format in every protocol.
  2. We discussed that there potentially will be no truly generic tracers. And every tracer may just copy current traceparent to tracestate under it's name. So tracestate is never empty. However talking to people implementing platforms (for instance azure services) I see a strong desire to avoid doing it. Mostly because they don't know which vendor will consume traces in the end of the day. Actual behavior here will contribute into what's typical pattern and what's not.

I like the idea to keep tracestate in correlation context for now as a workaround. @lmolkova it may be a good idea to work you are doing in Azure Functions. At least for the short term.

I am not trying to be difficult

I'm not trying to be pushy and see where you are coming from =)

codefromthecrypt commented 6 years ago
  1. We discussed that there potentially will be no truly generic tracers. And every tracer may just copy current traceparent to tracestate under it's name. So tracestate is never empty. However talking to people implementing platforms (for instance azure services) I see a strong desire to avoid doing it. Mostly because they don't know which vendor will consume traces in the end of the day. Actual behavior here will contribute into what's typical pattern and what's not.

Agree about currently no generic ones, but there could be. More importantly the lifecycle in-process is different than out of process. If you originate a trace in-process it could be that you don't want to serialize your state eagerly (in case you never push anything downstream). So you could have a special case in a library where you see no state, but you do see parent (for correlation purposes). This depends on if the api is exposing the serialized state. Agree that if it is deserialized authority of in-process state, then you always have state here.

cwe1ss commented 6 years ago

I'm with @vancem on this one. We have to be super careful with what gets added to Activity. Activity's main purpose is code "instrumentation" and that code shouldn't have to know about tracer-specific stuff like propagation headers etc.

If baggage is not sufficient for this, we must make sure that this has a general reason of existence - independent of any specification. The W3C spec is not yet finalized and even if it will be, it's "just one spec". It might win, it might not... There might be a new spec (or a new version of this spec) in 5 years.

There already are the proprietary Request-Id and Correlation-Context headers which have been hard-coded into the system. This mistake should not happen again. If the .NET framework decides to implement a specification, it should give people a choice to choose whichever specification they want.

codefromthecrypt commented 6 years ago

Baggage has been a liability in practice due to its lack of definition (like ttl), bloat, likelihood of leaking secrets, or abstraction breaking (ex encouraging business code to use tracing apis to do business functionality). It is problems with OT baggage (including entry control, privacy etc) which led to some of the delays in correlation context, ex how to make sure these problems aren't re-introduced under a new name. Many have mentioned baggage was a bit half-baked, if we are talking about OT baggage as opposed to the original.

TLDR I would steer very clear from having request scoping behavior being in any way tainted with OT behavior which in practice is propagate anything forever downstream via headers.

cwe1ss commented 6 years ago

.NET's Activity type is roughly based on OT's Span but it definitely isn't an OT implementation.

I had another look at the W3C spec document. Seems like what we currently have with Baggage more closely relates to W3C's Correlation-Context, right?

But I'm now confused by @SergeyKanzhelev example in this issue where he used sampling-rate as a key. Shouldn't that also be a Correlation-Context value? The W3C spec seems to only talk about vendor-specific span ids in its definition of the tracestate header.

mconnew commented 6 years ago

I know I'm coming late to this game, but I have some questions, some specifically about how this would relate to WCF. The W3C spec only talks about how this pertains to HTTP, but in the discussion there's talk about grpc and amqp. WCF also has it's own protocols so anything that is done has to be transport agnostic.
The W3C spec also quickly breaks down and fails hard when WCF is used. The tracestate can be up to 512 bytes long, and each state item can be up to 513 bytes (256 bytes for key, = character, 256 bytes for the value). Ignoring that off by 1 bug in the spec, you are only guaranteed to be able to have a single item in the tracestate. The whole purpose of the spec is to allow correlation across different tracing technologies. WCF has it's own end to end correlation tracing which if turned on means you have two different logging systems on at the same time (DiagnosticsSource based and WCF'd own tracing). This means you must propagate two tracing states or you will break one of the tracing correlation mechanisms. You aren't guaranteed to be able to do that.
But if this is the correlation standard is the one we are going to hitch our horse to, then there should be an API to manage the data. Otherwise WCF will need to write the parsing code and truncation logic to add our state to the data. But it shouldn't be part of the Activity class, it should be a separate implementation which can handle mutating the state correctly and exposing data strongly typed and have the ability to easily and cheaply import/export the values in a way that it can be added to the baggage as an opaque value. ApplicationInsights can have extra handling to pull the value out of the baggage and put it as it's own http header if needed. Frameworks such as WCF can add it as a SOAP header which carries the baggage transparently so the data flows and if this is the industry winner, add smarter handling later to add our own state.