census-instrumentation / opencensus-specs

Apache License 2.0
188 stars 50 forks source link

Proposal: record first local span's ID in SpanContext #229

Open axw opened 5 years ago

axw commented 5 years ago

I maintain the Elastic APM Go Agent. I've spent some time investigating whether we can implement an OpenCensus exporter, and I've hit a blocker that I think requires a change to opencensus-go.

The issue is that we have an additional concept of "transaction", which you can think of as the entry span in a service (e.g. an incoming HTTP request). Every span within a transaction records the trace ID, transaction ID, and (if it has one) parent span ID. Every span has a transaction ID, which may used for grouping. For each service in a trace, we would map the first local OpenCensus span (i.e. spans with a remote parent) to an Elastic APM transaction.

Because the IDs are all flowing to the exporter from opencensus-go, which has no concept of transaction, we cannot propagate the transaction ID to child spans in process.

Describe the solution you'd like

Record the first local span's ID in SpanContext / record the IDs of spans which have remote parents.

Describe alternatives you've considered

Alternative 1

I have considered keeping a map of OpenCensus span IDs to Elastic APM transaction IDs. This would be expensive to maintain, and would only work if the parent ends (i.e. is exported) before the child, which would be unusual.

Alternative 2

Another option would be something long the lines of the callbacks described in https://github.com/census-instrumentation/opencensus-specs/issues/70. Our exporter/plugin would be invoked when a span is created, along with the parent span context; we would store the transaction ID in tracestate, and propagate via this callback. There are a couple of issues with this approach:

Alternative 3

Change Elastic APM to make the transaction ID on spans optional. As of its current release (6.6), they are required. While we may lose some future functionality for trace data originating in the OpenCensus, but for now the impact would be minimal.

axw commented 5 years ago

Brave very recently introduced the concept of "local root", which is conceptually the same thing: https://github.com/openzipkin/brave/pull/801

codefromthecrypt commented 5 years ago

yes we also used this in testing aggregation techniques with expedia haystack/adaptive alerting last week. in cases where you want to reduce volume to just remote spans this can be handy

On Mon, Jan 21, 2019, 5:46 PM Andrew Wilkins <notifications@github.com wrote:

Brave very recently introduced the concept of "local root", which is conceptually the same thing: openzipkin/brave#801 https://github.com/openzipkin/brave/pull/801

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/census-instrumentation/opencensus-specs/issues/229#issuecomment-456010447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD6156HEPBBeZsC29pOP-Pv1i-_7Epoks5vFYyFgaJpZM4aKA9h .

SergeyKanzhelev commented 5 years ago

perf implications aside, I wonder if this will be the job for tracestate? The whole idea of tracestate is to hold a position of a span in a multiple distributed traces graphs. Holding Elasic's transaction ID it in tracestate will also give you parent/child relationship between transactions as tracestate will be propagated across processes.

axw commented 5 years ago

perf implications aside, I wonder if this will be the job for tracestate? The whole idea of tracestate is to hold a position of a span in a multiple distributed traces graphs. Holding Elasic's transaction ID it in tracestate will also give you parent/child relationship between transactions as tracestate will be propagated across processes.

FYI, we are implementing the W3C Trace-Context draft, but we currently only use traceparent and not tracestate.

It could hypothetically be useful to propagate transaction IDs between processes, but we do not currently have a need for that. In our model, a transaction's parent may be either another (remote) transaction, or a (remote) span. We set the "span" ID in traceparent to either the remote transaction or span ID. If I understand @adriancole's comment correctly, we're doing something very similar: in some cases we elide a transaction's details (i.e. its local spans), and connect the transactions directly.

Regardless, it would require vendor-specific code in the application, right? And so the exporter service wouldn't be possible?

codefromthecrypt commented 5 years ago

FWIW the reason we said localRoot is to imply this isnt likely useful in remote (propagated) contexts.

so it is a library concern but not likely in scope for headers

On Tue, Jan 22, 2019, 1:45 AM Sergey Kanzhelev <notifications@github.com wrote:

perf implications aside, I wonder if this will be the job for tracestate? The whole idea of tracestate is to hold a position of a span in a multiple distributed traces graphs. Holding Elasic's transaction ID it in tracestate will also give you parent/child relationship between transactions as tracestate will be propagated across processes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/census-instrumentation/opencensus-specs/issues/229#issuecomment-456152881, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD614bmcu20n6sjQH07kVYarK3arT3sks5vFfyjgaJpZM4aKA9h .

SergeyKanzhelev commented 5 years ago

Got it. So it's not an issue of how to implement it - there are at least two ways OC SDK should be able to support this scenario - via local-only scoped tags or via tracestate. It is about a new concept to be introduced natively.

Is transaction ID the only thing needed? In Application Insights we used to have a notion of local operation name (basically localRoot.Name). This name is also useful (maybe even more) for aggregations. User ID and Session ID obtained by the incoming request handler are also good candidates to store in this localRoot context natively. Do you think those are in scope of this issue?

codefromthecrypt commented 5 years ago

so your question is whether to add more scope than the id? specifically names are problematic as they change over time where the id does not. That said, I wonder if you couldn't create a mapping to any data given the local root.

On Wed, Jan 23, 2019, 1:01 AM Sergey Kanzhelev <notifications@github.com wrote:

Got it. So it's not an issue of how to implement it - there are at least two ways OC SDK should be able to support this scenario - via local-only scoped tags or via tracestate. It is about a new concept to be introduced natively.

Is transaction ID the only thing needed? In Application Insights we used to have a notion of local operation name (basically localRoot.Name). This name is also useful (maybe even more) for aggregations. User ID and Session ID obtained by the incoming request handler are also good candidates to store in this localRoot context natively. Do you think those are in scope of this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/census-instrumentation/opencensus-specs/issues/229#issuecomment-456477682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61zEvWRK9iNK2YPkHFztv4QeYjybPks5vF0PMgaJpZM4aKA9h .

axw commented 5 years ago

Is transaction ID the only thing needed?

For us: currently, yes.

In Application Insights we used to have a notion of local operation name (basically localRoot.Name). This name is also useful (maybe even more) for aggregations. User ID and Session ID obtained by the incoming request handler are also good candidates to store in this localRoot context natively. Do you think those are in scope of this issue?

I think User ID and Session ID might fit better in Correlation-Context, since they may be useful for aggregations across services. IIANM, this is already handled with instrumentation-provided tags.

For Elastic APM's exception reporting, we could make use of a locally propagated transaction name/type for aggregation/filtering. That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

SergeyKanzhelev commented 5 years ago

so your question is whether to add more scope than the id?

@adriancole I'm wondering if local root ID is just one of properties alongside other things that may be promoted to local/remote tags or to the tracestate. There may be elaborate configuration for this kind of attributes to tags promotion.

From API cleaness - implementing local root ID natively may require new methods like StartScopedLocalRoot or ResetLocalRootId on span builder. When implementing it via tags or tracestate extension module will keep APIs simpler.

That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

Can you please elaborate on why local root ID fits and name/type doesn't from your perspective?

axw commented 5 years ago

That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

Can you please elaborate on why local root ID fits and name/type doesn't from your perspective?

I was referring to exception reporting here, not local root IDs. Elastic APM provides a means of reporting exceptions that occur within a transaction or span; OpenCensus does not (aside from setting the status of spans). If there were some mapping, then there might be additional local root/transaction properties that we would want to propagate locally.

From API cleaness - implementing local root ID natively may require new methods like StartScopedLocalRoot or ResetLocalRootId on span builder. When implementing it via tags or tracestate extension module will keep APIs simpler.

I'm relatively inexperienced with OpenCensus, so can you please expand on that a bit? Specifically how is that cleaner than my proposal? In case my proposal was not clear, here it is again with references to the Tracer and SpanBuilder from the Java API:

For Spans created via a SpanBuilder obtained with Tracer.spanBuilderWithRemoteParent, the span's ID will be recorded in its SpanContext as "local root ID". This will be inherited by descendant spans created within the same process.

codefromthecrypt commented 5 years ago

hi Sergey. having implemented this feature I can assure there is zero api surface area. the first entry is implicit and can be done internal to the tracer. in brave there was no new Api method needed to track local root as it is defined.

like I said.. a span id is immutable. when you ask how things are different between this and mutable data such as span name please ponder this. if it isnt clear practice on your own and you will notice the difference.

On Thu, Jan 24, 2019, 4:03 AM Sergey Kanzhelev <notifications@github.com wrote:

so your question is whether to add more scope than the id?

@adriancole https://github.com/adriancole I'm wondering if local root ID is just one of properties alongside other things that may be promoted to local/remote tags or to the tracestate. There may be elaborate configuration for this kind of attributes to tags promotion.

From API cleaness - implementing local root ID natively may require new methods like StartScopedLocalRoot or ResetLocalRootId on span builder. When implementing it via tags or tracestate extension module will keep APIs simpler.

That doesn't fit with OpenCensus anyway though, so that's very hypothetical.

Can you please elaborate on why local root ID fits and name/type doesn't from your perspective?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/census-instrumentation/opencensus-specs/issues/229#issuecomment-456946092, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD619SHvuubWjKWDEavQ7eK41q6Vncnks5vGL_9gaJpZM4aKA9h .

SergeyKanzhelev commented 5 years ago

Yeah, I meant "fullness", not "cleaness". Sorry. I looked at brave PR @adriancole posted above. My comment was that with any new concept there will be people who want to have better control over when local root being set and reset. One example may be if incoming request is tracked by automatic module, but one want to signal that the "real" local root is the span that is tracked from customer code and which has all the information representing local root like span name and attributes. Another scenario is a trace that initiate another trace as a background job. Given those may be just theoretical - @adriancole I wonder if anybody asked for this after this concept was introduced?

Another two questions for the new concept I was trying to raise with my questions:

  1. Is it universal and needed for every (or most) open census user?
  2. Is it generic enough and whether the very next request will be to expose other local root span properties?

First is enough to have it. Second is nice to think about in advance.

I'm all for the scenarios enabled via local root. And I understand desire to make this feature to be pre-installed. My first reaction on this is that configurable extension approach would be a better option here. What are your thoughts on whether it's universal and generic.

codefromthecrypt commented 5 years ago

I think the target is exporters not end users. I don't expect end users to employ this data. it could seem fine to wait for more people to ask here specifically. line brave we already waited until multiple use cases presented themselves before adding the feature. it is no harm waiting for people to make this popular if you have some gauge for what would make it popular enough.

I would offer that the local root functionality is technically difficult if possible to do as a bolt on. If there is an efficient way that doesnt add api surface area personally I wasnt able to find it. for example the tracer itself is the only thing that knows when a span id has ever hit it and tracking that externally ironically might mean adding more api change to make it possible (or reliance on tools like bytecode weaving)

On Mon, Jan 28, 2019, 7:37 PM Sergey Kanzhelev <notifications@github.com wrote:

Yeah, I meant "fullness", not "cleaness". Sorry. I looked at brave PR @adriancole https://github.com/adriancole posted above. My comment was that with any new concept there will be people who want to have better control over when local root being set and reset. One example may be if incoming request is tracked by automatic module, but one want to signal that the "real" local root is the span that is tracked from customer code and which has all the information representing local root like span name and attributes. Another scenario is a trace that initiate another trace as a background job. Given those may be just theoretical - @adriancole https://github.com/adriancole I wonder if anybody asked for this after this concept was introduced?

Another two questions for the new concept I was trying to raise with my questions:

  1. Is it universal and needed for every (or most) open census user?
  2. Is it generic enough and whether the very next request will be to expose other local root span properties?

First is enough to have it. Second is nice to think about in advance.

I'm all for the scenarios enabled via local root. And I understand desire to make this feature to be pre-installed. My first reaction on this is that configurable extension approach would be a better option here. What are your thoughts on whether it's universal and generic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/census-instrumentation/opencensus-specs/issues/229#issuecomment-458249748, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61_gdAlu9DG0_MHDG5h0DDGHqbrt2ks5vH0NjgaJpZM4aKA9h .

SergeyKanzhelev commented 5 years ago

I would offer that the local root functionality is technically difficult if possible to do as a bolt on.

Is it what Skywalking requested here? I might be missing bigger context here

axw commented 5 years ago

Another two questions for the new concept I was trying to raise with my questions:

Is it universal and needed for every (or most) open census user?

As Adrian says above, the target is exporters and not (directly) end-users. For anyone wanting to export OpenCensus spans to Elastic APM, this is necessary.

I can't speak for any other solution, though it's clearly not universal as there are other exporters without this need. Every solution is a little different of course, taking advantage of their particular constraints.

Is it generic enough and whether the very next request will be to expose other local root span properties?

I do not anticipate such requests coming from our end.

bogdandrutu commented 5 years ago

Hi @axw and everyone.

First: Sorry that I am late in this discussion.

Second: The current design of the library assumes that SpanContext is the "propagated" between processes (client to server RPCs), which I don't think it is the case for this. I am trying to think to not add any other non-propagated property in the SpanContext, what if add this to the Span?

PS: this is a bit of brainstorming so ideas are welcome.

axw commented 5 years ago

@bogdandrutu no worries.

The current design of the library assumes that SpanContext is the "propagated" between processes (client to server RPCs), which I don't think it is the case for this.

Makes sense, and you're right that this does not need to be propagated inter-process.

I am trying to think to not add any other non-propagated property in the SpanContext, what if add this to the Span?

Yes, that would work: record it on Span, and copy across to SpanData for the exporter. I've updated the opencensus-go PR accordingly.

codefromthecrypt commented 5 years ago

To make a design that does this in the Span, assumes the context in process is tightly coupled with the span anyway (a single object w/1-1 relationship). It also assumes things that want to snoop on context properties such as trace IDs are doing that via the span (ex logging integration). If some of these aren't the case, I'd recommend putting this with the rest of the properties in the same place so as to make integrations coherent.

Also, there's the immutability factor. local root needn't be mutable.

axw commented 5 years ago

@bogdandrutu https://github.com/census-instrumentation/opencensus-go/pull/1029 was merged; not my intention, but that's fine by me if you're happy with it. How should we proceed?

tsloughter commented 5 years ago

Rather than extending the span data can't this be an attribute?

axw commented 5 years ago

Rather than extending the span data can't this be an attribute?

Attributes aren't mentioned at https://opencensus.io/tracing/span/, but based on what I've gleaned from code, I understand that attributes are meant to describe properties of the operation that a span represents, and should not be required for describing causality or anything else crucial to the tracing system.

At least in the OpenCensus Go implementation, attributes are capped and can be dropped: https://github.com/census-instrumentation/opencensus-go/blob/57c09932883846047fd542903575671cb6b75070/trace/trace.go#L47-L49

tsloughter commented 5 years ago

Yea, true, it is meant to to drop when a limit is reached. If it isn't a "good to have" to improve experience then I guess that isn't an option.

tsloughter commented 5 years ago

Is this information used during the trace? Or only on the elastic backend when bringing everything together?

Spans can already be set that their parent is remote, not in the same process, which would signal that the span id for that span is the transaction id. It just wouldn't be available if needed when other children are creating/handling spans.

felixbarny commented 5 years ago

The API spec of the Elastic APM Server requires the transaction_id of the span: https://github.com/elastic/apm-server/blob/c9bf185f68ccb318d93897208a4611bc83d95afc/docs/spec/spans/span.json#L14

The UI needs that information in order to be able to see all spans of a transaction. This view is especially interesting to service owners which can use it to dig into why a particular transaction was slow, without the noise of the rest of the trace. This view also allows to "zoom out" and see the full trace.

Another thing where the entry span can be useful is to collect skeleton traces (Adrian already elaborated about this use case). Basically, you'd only send up the entry spans aka transactions. In order for the parent ids to make sense, the client spans don't propagate their ids but rather their transaction ids.

This request is not about adding support for skeleton traces, which may be a follow up for this. This is to enable a basic Elastic APM exporter. As mentioned, the APM Server requires the transaction_id. As currently there's no way to get the transaction id aka span id of the entry span, we're unable to integrate with OpenCensus.

tsloughter commented 5 years ago

Yea, I wasn't suggesting it would fit the api today but what I thought was an alternative way for the Elastic APM Server to put this information together, instead of extending the opencensus span definition.

axw commented 5 years ago

After some more investigation, we've come up with an alternative that unblocks creating an exporter for Elastic APM: make the transaction ID optional for spans. This may mean loss of future functionality, but the initial impact will be minimal.

So this can be downgraded to "nice to have" for us.

codefromthecrypt commented 5 years ago

by the way, this is actively desired by zipkin users. I'm not sure why people are theoretical about this. It exists in multiple APMs systems and there seems to be a disbelief in the value. anyway I've seen you've reverted this, so we can add it to the "no idea why census' doesn't get it list" along with messaging spans