envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.67k stars 4.75k forks source link

RFC: Update the tracing docs to better explain options available to service developers #1711

Closed objectiser closed 6 years ago

objectiser commented 6 years ago

When using the Zipkin tracer, the documentation currently indicates that five (B3) header properties need to be manually propagated by the application to enable end to end tracing to work.

In actual fact, regardless of whether using Zipkin or LightStep, the only header property that needs to be propagated is x-ot-span-context.

This doesn't prevent the application from using a tracer compatible with the B3 trace context format. The B3 headers are still passed into the application - and if the application makes use of those headers to (for example) create additional child spans, the updated B3 headers can be injected into any outbound request and will be used by the Egress proxy (once #1702 is merged).

Therefore I would propose that the documentation is updated to make this clear.

However it has also been suggested that the use of x-ot-span-context header, when Zipkin is the configured tracer, can now be removed - as the original reason for its use no longer applies.

It would be good to have some end user opinions on which direction is preferred:

  1. Keep use of the single x-ot-span-context, and therefore application only has to manually propagate a single header for trace context

  2. Remove use of x-ot-span-context when zipkin tracer is configured, and the application needs to propagate all 5 of the B3 header properties.

objectiser commented 6 years ago

cc @fabolive @adriancole

mattklein123 commented 6 years ago

cc @rshriram @bhs @goaway @RomanDzhabarov

In general, I would love to do a pass on the tracing docs and making them more clear across the features supported for every tracer, and then have specific docs for the currently supported ones (Zipkin and LS). Anything we can do here to simplify/clarify sounds great to me.

codefromthecrypt commented 6 years ago

x-ot-span-context is not a standard (defacto or otherwise) header. I would suggest against implying we should proliferate use of this. For example, some use B3 because appliances already permit the headers (or other systems use it). That's certainly not the case with x-ot-span-context

mattklein123 commented 6 years ago

From my perspective I like an option that only requires a single header to be propagated. Personally I would just support both and clearly document what is possible. If people in general want to remove x-ot-span-context from Zipkin tracer that's also fine with me.

codefromthecrypt commented 6 years ago

I think a single header that would be used by multiple vendors is yet to come albeit in progress (https://github.com/TraceContext/tracecontext-spec) unless i am mistaken x-ot-context is a lightstep thing. I dont know why it would be relevant for zipkin or rather B3 compatible systems (of which arent all zipkin) except for historical reasons in envoy.

mattklein123 commented 6 years ago

i don't really care what the single header is called. The fact that no one has been able to pick one is all the more reason to just make it happen in Envoy and push it forward. So my vote here continues to be to support both options and clearly document. To me there is no downside of this approach.

bhs commented 6 years ago

In the immediate term, I agree with @mattklein123's [pragmatic] reasoning. I also agree with @adriancole in that – in the long term – we should move beyond x-ot-span-context towards something that's both "standard" (sic?) and general-purpose. https://github.com/opentracing/specification/issues/86 represents a few thoughts about this... my favorite personal outcome would be the header name (Trace-Context) from https://github.com/TraceContext/tracecontext-spec and a header value that's similar but slightly more general (variable length ids, ideally basic support for baggage).

codefromthecrypt commented 6 years ago

i don't really care what the single header is called. The fact that no one has been able to pick one is all the more reason to just make it happen in Envoy and push it forward. So my vote here continues to be to support both options and clearly document. To me there is no downside of this approach.

I feel the frustration, but yeah promoting "x-ot-span-context" because no-one can "pick one" is disappointing. I hope once you are at a non-frustrated moment, you could see downsides of amplifying this. If you did want a "pick one" seems someone at lyft could participate in trace-context simlarly to how jeff pinner did in http/2 back in the day. You do have some means to make the reality you desire happen in a responsible way, congruent with the power and influence you have as the controller of well.. a lot of people's controllers.

mattklein123 commented 6 years ago

I hope once you are at a non-frustrated moment, you could see downsides of amplifying this. If you did want a "pick one" seems someone at lyft could participate in trace-context simlarly to how jeff pinner did in http/2 back in the day.

@adriancole I'm sorry that you think that me pushing this forward is irresponsible and solely out of frustration. That's really not the case. I'm also more than happy to participate in working groups on this subject, with the caveat being that I would like to make meaningful progress which I have not seen happen so far (so yes you are right I am a little frustrated!).

In this particular case, I don't really see the downside of leaving working code in Envoy Zipkin that can do propagation with a single header. We all agree that we should make the B3 headers work perfectly. If someone prefers to propagate a single header they can do that also and we can document/warn in the Envoy docs that it is non-standard.

Fundamentally, I am a pragmatic person, and I want to make progress. You are correct that Envoy is a bit of a hammer in this regard, but sometimes that's what it takes to get things unclogged and moving. Per @bhs ultimately I would prefer that we standardize on some single header and Trace-Context seems fine to me. So if this issue and/or Envoy adds a bit of extra urgency to making progress on that front so be it. Let me know how I can help and let's make something happen that we all agree on.

louiscryan commented 6 years ago

I'm pretty much with Matt on this and sympathize with his frustration around progress on alignment. He's frustrated (and me too) not out of some desire for architectural purity but because Envoy is expected to carry a significant implementation and maintenance burden to normalize these issues for our users.

Envoy is a big hammer for projects like Istio and I want to make sure we use it wisely. Using Envoy as the means to promote standardization seems sensible to me. i.e Envoy transcribing between the commonly used trace formats as needed by the applications to which it is attached as a sidecar / reverse proxy for inbound traffic while rigorously using a single format for interchange between Envoy instances seems like a reasonable first step. If at a later point application frameworks pick up support for that 'one' format then the performance of the mesh improves.

This puts a pretty significant maintenance responsibility on Envoy to support a reasonable set of transcribable target formats and it also puts a configuration burden on operators while there is tracing implementation diversity within services in a mesh. I think it's also reasonable for Envoy developers to carry the burden for building/testing for some of the more widely used formats but not all of them.

In that vein, what do we think is reasonable to ask Matt and co. to support? Zipkin, OT, TraceContext, Jaeger, ... Thats already a pretty sizeable list and we haven't gotten to the commercial vendors

codefromthecrypt commented 6 years ago

Your maintenance burden so far stems from initially making envoy out of the box lightstep only. Later zipkin was added which is 2, the latter being supported by far more except that there were bugs that were not prioritized to fix until recently. I didnt realize that the two formats are overwhelming your ability to staff the project.. I get it I am one person and am overwhelmed often.

I have invited both of you often to join and help get a trace context format together..That you want this done without participating is fine, if you want it to happen faster.. that's something you can affect.

There are several active discussions which are progressing and a pilot of the trace-context format is being piloted in grpc since v1.4 I think. Just last week we discussed this in a spring offsite, and this weekend a key piece to make it work in brave (zipkin) landed. Bear in mind.. I am only one person, yet took time to engage folks you arent about this common goal. For example, I spoke to dynatrace and others who are meeting in Nov about this. Some are using B3 tactically until trace-context is standards track. All this will reduce your load later.. it motivates me to help you.

The thing is though.. if you want really can move faster if you participate. Calling it quits while volunteers like me are working hard on this is demotivating.

codefromthecrypt commented 6 years ago

concretely, here's my advice:

For help, if you can watch or comment on things you care about on the spec. https://github.com/TraceContext/tracecontext-spec

If you can only spend a little time, carve out some remote time for this at the next workshop, which is where trace-context will be a big topic. Last time I asked no-one representing service mesh showed up, although Kevin from Linkerd said he will in Nov https://docs.google.com/document/d/19SQOio1z7mHqb79MzLQCiKdkq5tdirOW8OCcTQNQdUg/edit#

On the spec, there are important pieces for interop to actually work, I'd encourage folks who can feedback to be able to try things. For example, a trace surviving going through envoy or istio. Sometimes folks are tempted to say ... just accept anything, which doesn't guarantee in any way a trace will survive. Sometimes folks assume that "one header" will work for all things.. we know that there are separate needs for retry budgets etc. I'm hoping we can have honest discussions as opposed to rushed ones. We all want this to work and not be another debt item speckled everywhere.. good tests to the degree you can help facilitate are crucial for that.

bhs commented 6 years ago

It's odd that we're getting into the weeds of retry budgets/etc when we lack even a standard header name or a standard format that supports ids of different lengths (since different Tracers presently use ids of different lengths).

I would argue for something like https://github.com/TraceContext/tracecontext-spec/pull/19, followed by support within Envoy for same, followed by work to standardize the way that the actual Span data (not just the context data, I mean) gets out of the Envoy process and into a tracing system.

Once that's done, we can pursue a version of the context work that incorporates the more exotic features (special support for retries, fancy sampling, etc). Working through the exotic stuff now is going to delay some straightforward improvements and generalizations in Envoy (and elsewhere).

codefromthecrypt commented 6 years ago

It's odd that we're getting into the weeds of retry budgets/etc when we lack even a standard header name or a standard format that supports ids of different lengths (since different Tracers presently use ids of different lengths).

The point is that there has been cries for a single header to carry all propagated context. I think that is an invalid expectation. Not trying to thrash on this topic.

mattklein123 commented 6 years ago

I think that is an invalid expectation

I do not think it's an invalid expectation. To me it's pretty straightforward. I agree with @bhs that it's time to start making progress here and that "the perfect is the enemy of the good."

From my perspective as long as the header format is versioned in some way we can start with something and iterate later. (At minimum versioning could be some version bits. It could also be using something like proto to define the header contents which is inherently expandable in the future).

I'm committed to making progress on this subject as IMO it's one of the largest pain points of users of a transparent service mesh.

I would argue for something like TraceContext/tracecontext-spec#19, followed by support within Envoy for same, followed by work to standardize the way that the actual Span data (not just the context data, I mean) gets out of the Envoy process and into a tracing system.

+1. We will continue to work with @bhs and the OT folks to find a path forward here. We would of course love to work with anyone else and am happy to have in person conversations if necessary. (Quite honestly I have never been asked to attend a tracing meeting).

As an aside, I would suggest that we table this issue for now as we are getting off topic into more foundational conversations that are probably better held elsewhere.

I think the resolution of this specific issue is: B3 is now fully supported for Envoy Zipkin and we are committed to keeping that working. We won't remove the x-ot-span-context code as it is a working example of how to do single header propagation. We also won't push users towards it. We welcome any/all clarifying doc updates about the current state of the world. Fair?

bhs commented 6 years ago

@mattklein123

Fair?

Sounds good to me. And I agree that the actual issue here is a lot simpler than the more foundational things it's bringing up.

codefromthecrypt commented 6 years ago

On 25 Sep 2017 01:30, "Matt Klein" notifications@github.com wrote:

I think that is an invalid expectation

I do not think it's an invalid expectation. To me it's pretty straightforward.

If one header to do all things, retries duration, deadlines, feature flags, baggage etc is expectable and the industry will use this such that you have one and only one header to ever propagate.. I am on board! Can you find another person besides ben that thinks we can basically stop using all headers except one from now on? Realistically speaking? Since this is easy for you, convincing others should be too, as their jobs will be easy. We would have no http/2 compression anymore, but at least you will have one header.

mattklein123 commented 6 years ago

If one header to do all things, retries duration, deadlines, feature flags, baggage etc

As @bhs said I think it's premature to start talking about this (though ultimately useful). Trace propagation is the biggest pain point that I see now and we can solve that first. As long as the format is extensible we can add more things later (maybe).

Can you find another person besides ben that thinks we can basically stop using all headers except one from now on?

Lots of people would like to see this happen. It will not happen overnight. If we never start making it happen, it will never be so. As @louiscryan said, one of the benefits of something like Envoy is that we can transcode between different formats to help users migrate over time.

We would have no http/2 compression anymore

h2 static compression is dubiously useful for the data I am most concerned about, which will change for every request.

codefromthecrypt commented 6 years ago

If one header to do all things, retries duration, deadlines, feature flags, baggage etc

As @bhs https://github.com/bhs said I think it's premature to start talking about this (though ultimately useful). Trace propagation is the biggest pain point that I see now and we can solve that first. As long as the format is extensible we can add more things later (maybe).

This is the point of contention, though. The effort was already scoped to handle just trace propagation (and not try to include arbitrary fields as a part of the same header). The expectation that one header will obviate any need for other request-scoped fields is what I said was unrealistic, which seems we don't actually disagree with!

Can you find another person besides ben that thinks we can basically stop using all headers except one from now on?

Lot's of people would like to see this happen. It will not happen overnight. If we never start making it happen, it will never be so. As @louiscryan https://github.com/louiscryan said, one of the benefits of something like Envoy is that we can transcode between different formats to help users migrate over time.

If we are talking about a trace context header, then clearly there are many who think this can happen. That's why we started this effort and have a header. transcoding this is easier if you don't require this to also be an arbitrary baggage system.

We would have no http/2 compression anymore

h2 static compression is dubiously useful for the data I am most concerned about, which will change for every request.

When we take arbitrary fields out of scope, then yes (except maybe a potentially 32char trace ID which does repeat on successive requests eg fan-out within a connection)

When we start scoping in arbitrary baggage, like formally starting to tunnel fields into a header value as has been hinted before, then I think you will agree it will interfere with hpack. These fields are currently tossed around as string names, not index numbers. Things like foo-meta-X can index, but if X is embedded in a value, not even the field name can be indexed. Things like this is why I think it is important to not link together the requirements to propagate trace context with arbitrary fields. If we do, then we have to vocalize the tradeoffs we know when we start walking down that path, one of them is hpack, but there are more.

TL;DR; let's not write it off as easy, tunneling arbitrary data portably through a trace context header in a way others can use it. This sort of thing can swamp the ship, and we know that. At the very least there are real costs in terms of coordination and complexity, perhaps literally bigger byte costs. 2p is we can be better off having such an idea as a "maybe" like you said, vs an expectation.

codefromthecrypt commented 6 years ago

As an aside, I would suggest that we table this issue for now as we are getting off topic into more foundational conversations that are probably better held elsewhere.

agreed. it is better held elsewhere.

I think the resolution of this specific issue is: B3 is now fully supported for Envoy Zipkin and we are committed to keeping that working. We won't remove the x-ot-span-context code as it is a working example of how to do single header propagation. We also won't push users towards it. We welcome any/all clarifying doc updates about the current state of the world. Fair?

sure, sounds fair. Thanks for that

louiscryan commented 6 years ago

@adriancole @bhs

given the amount of background reading to come up to date on all the issues perhaps the best next step for me is to provide review feedback on the TraceContext spec ?

If there were some cliff notes for the outstanding points of disagreement that would be helpful too. Pointers welcome

codefromthecrypt commented 6 years ago

given the amount of background reading to come up to date on all the issues perhaps the best next step for me is to provide review feedback on the TraceContext spec ?

Sounds great

If there were some cliff notes for the outstanding points of disagreement that would be helpful too. Pointers welcome

Will send things your way.

mattklein123 commented 6 years ago

@adriancole can you send to me also. Thank you.