erlef / observability-wg

Project for tracking the work of the Observability Working Group
Creative Commons Attribution Share Alike 4.0 International
61 stars 9 forks source link

Investigate use of seq_trace label for span context #2

Open tsloughter opened 5 years ago

tsloughter commented 5 years ago

Add Vince to assignees when he accepts his invite to the repo.

From the meeting minutes:

zachdaniel commented 5 years ago

Vince made this and linked to it in the minutes google doc, but I'm posting it here so everyone can see it:

https://github.com/binaryseed/seq_trace_intro

binaryseed commented 5 years ago

Before we can determine if it's usable, we need to clarify what it'd be used for...

Are we suggesting adopting this in individual projects that do tracing like Open Census , or creating some kind of light-weight telemetry_tracing like project that would be adopted by various projects?

tsloughter commented 5 years ago

I do wish we had a shared "context" library. I threw this together for opencensus and grpcbox https://github.com/tsloughter/ctx

But that is bigger than span context, it is what we keep span context in when not using the pdict. Even though it would not hurt to have tags and deadlines propagate with messages, I wouldn't want to shove that in the seqtrace label.

Back on topic.. we probably don't need to create a project for just this.

tsloughter commented 5 years ago

I don't expect to see an issue but we probably want to do some benchmarks on use of process dict, seq_trace and regular map with span context inside. Including to make sure it doesn't hurt message passing performance.

zachdaniel commented 5 years ago

That’s a good point. If we went this route, we’d be forcing all messages sent from a traced process to include additional data, as opposed to having to “manually” propagate context.

tsloughter commented 5 years ago

I think that is fine when it is just a few integers. But it should be benched.

tsloughter commented 5 years ago

I just opened this on opencensus but it is partly related to this issue and measuring performance of different context storage methods https://github.com/census-instrumentation/opencensus-erlang/issues/155

zachdaniel commented 5 years ago

I suspect you are right, but it would be a new side effect that we would probably want to make very clear. I think your point in our meeting was a good one that this sort of "hijacks" a utility that can only have a single user at any one time. I also agree with @binaryseed that that is probably not motivation enough to not use it (especially if it can be disabled via configuration). But if someone has some tight system level constraints, adding even small overhead to message passing could theoretically be problematic. Or if someone has a multi-network cluster, adding extra data to messages could physically cost them money.

binaryseed commented 5 years ago

I think doing some benchmarks is a great idea. Based on my experience, I suspect that leveraging the VM to pass context will be plenty fast, likely faster than manually passing full context maps around in application code.

The pattern I'd suspect will be ideal is that the seq_trace label contains minimal information, basically just a trace identifier (and anything else needed very frequently), and we use an ETS table keyed off that ID to store any other trace data. Wether we do a ETS per trace or a single global ETS or a partitioned set could be informed by a benchmark too.

So the cases we'd want to compare, roughly:

@tsloughter How does the process dictionary work when crossing process boundaries? I assume there is some manual instrumentation to propagate the context?

tsloughter commented 5 years ago

@binaryseed for the pdict yea, you'll need something that is just like X = get(span_ctx) and then it acts the same as with manual context passing as a variable, but with the additional step of on the other side having to do put(span_ctx, X).

binaryseed commented 5 years ago

I did some benchmarking documented here: https://github.com/binaryseed/seq_trace_intro/issues/1

There were pretty minimal differences in overhead (on the order of 0-2 us on my very old macbook air)

tsloughter commented 5 years ago

@binaryseed nice. I'll work on adding to that benchmarking to cover cases that are just regular context processing (as in, without message passing).

tsloughter commented 5 years ago

I've taken a first go at implementing this for OpenTelemetry https://github.com/open-telemetry/opentelemetry-erlang/pull/9/commits/acd68115019192383d40a3a752ac99b9f9cfa51a