aws-observability / aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)
https://aws-otel.github.io/
Other
574 stars 239 forks source link

Make ALBs first-class processes in OTel traces #738

Open spencerwilson opened 2 years ago

spencerwilson commented 2 years ago

Is your feature request related to a problem? Please describe. When adding OpenTelemetry instrumentation to systems involving HTTP servers, I'd like request traces to contain a span for each process or host the request passes through. It's currently hard to obtain spans for time spent in AWS Application Load Balancers.

Throughout this issue, consider the following system:

// "A -> B" means process A sends an HTTP request to process B

client -> ALB1 -> app 1 -> ALB2 -> app 2

As of November 2021, ALBs are unaware of OpenTelemetry. They're compliant with W3C Trace Context spec in that they pass through traceparent and tracestate unmodified, if they receive them.[^1]

[^1]: I'm actually unsure if header pass-through is sufficient to make a process literally "compliant" with the spec; I've admittedly not run a fine-toothed comb over the spec.

Naturally, the support for X-Ray's tracing header, X-Amzn-Trace-Id, is a bit greater: they add or modify this header.

Helpfully, a value for X-Amzn-Trace-Id is also present in ALB access logs (doc)—although it's unclear if the logged value is the value of the header received by the ALB or sent in the request to the upstream target.

Describe the solution you'd like

I'd love it if ALBs could act like other processes in my system (which is trying to follow recommendations for greenfield OTel systems).

First, make ALBs support the following two configurations:

The definition for the latter can be found under "allowed mutations [to traceparent]" defined in the W3C Trace Context spec's section "3.4 Mutating the traceparent Field". Notably, this mutation effectively trusts the incoming trace context, including sampling decision—trust which can create a DoS vulnerability. As a future enhancement, supporting the "Restart trace" mutation defined in the spec might be a more secure alternative to "Update parent-id". I don't have an opinion on how an ALB should set the sampled flag during "Restart trace" mutation.

Second, do one of the following:

Describe alternatives you've considered

Additional context It might be interesting to hear the rationale behind AWS X-Ray's decision to not create X-Ray segments for hops through ALBs:

Load balancers do not send data to X-Ray, and do not appear as a node on your service map.

As a service operator I'm keen on knowing as much as a can about the entire request path.

sethAmazon commented 2 years ago

You may want to ask this question in the instrumentation repos. I do not think this is related to the collector @Aneurysm9 @anuraaga

spencerwilson commented 2 years ago

Thanks for the reply, @sethAmazon ! As I mentioned in the "alternatives you've considered" section, I don't see a way that any amount of open source or third party code could make it possible to achieve the goal stated at the top:

I'd like request traces to contain a span for each process or host the request passes through.

I'd be happy to be proven wrong, though! But if that's true then I'm not sure instrumentations are relevant here.

Rather, my understanding is that the ALB product with its current feature set precludes achieving the stated goal. In that sense this issue is primarily a feature request for the ALB product. aws-otel-collector is relevant only secondarily: The UX here could be really delightful if ALB had certain new features && there was a Collector receiver that consumed those features.

Curious to hear your thoughts, and please let me know if there might be better places to raise this.

willarmiros commented 2 years ago

Hi @spencerwilson,

Thank you for this write up, hopefully I can shed a bit more light on the current tracing integrations with ALB and how best to raise this feature request.

It might be interesting to hear the rationale behind AWS X-Ray's decision to not create X-Ray segments for hops through ALBs

At the time of X-Ray's initial integration with ALB to support the X-Amzn-Trace-Id header, the ALB team did not want to emit a segment to the X-Ray backend because ALB is first and foremost an ultra-low-latency load balancer, and they were concerned with the overhead of creating and emitting spans. However I think it still could be feasible as an opt-in feature.

The best way to raise this request to ALB's attention is by submitting a ticket with AWS Support for the ALB service with this feature request. Customer demand is a big driver of new features, and both the enhanced W3C trace header support and active tracing of ALB (e.g. generating span data instead of only mutating the trace header) are both on our radar, so this demand helps us get them prioritized.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

willarmiros commented 2 years ago

@alolita can we have stalebot ignore issues like this feature request? It will still be relevant.

Aneurysm9 commented 2 years ago

@willarmiros Would https://github.com/aws-observability/aws-otel-collector/pull/1054 and a Future Features milestone work?

willarmiros commented 2 years ago

@Aneurysm9 yep! Thanks

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been marked as stale for 30 days with no activity.

willarmiros commented 2 years ago

@Aneurysm9 can we keep We should keep this open til we've figured out a way to deal with ALB trace headers

raimi commented 1 year ago

I'd like to add the perspective of someone running a system like described above, where "app 1" and "app 2" are under my control, while "ALB1" and "ALB2" are provided as-is by AWS:

// "A -> B" means process A sends an HTTP request to process B
client -> ALB1 -> app 1 -> ALB2 -> app 2

(I have a log- and trace-collecting tooling in place, that reads all ALB access logs and all logs of all services/apps.)

net result: ALB access logs of ALB 1 are not correlated to the rest of the system (which works).

Please: Please do not wait for AWS to change the behaviour of the ALBs; this will not happen. Please: Just accept a minimal X-Amzn-Trace-Id header and let it start a new trace with the given trace-id. Start a new span, whatever. But keep the value so that the rest of us have a chance of finding their logs :)

Thanks!

P.S.: This has nothing to do with Amazon X-Ray directly. It's just a question of using the de-facto standard X-Amzn-Trace-Id.