elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
374 stars 111 forks source link

Traceparent header overriden by an intermediate service #609

Open n-e opened 2 years ago

n-e commented 2 years ago

In the W3C TraceContext specification it's possible that a system is monitored by multiple tracing tools.

Example: A -> B -> C (Service A calls Service B which calls Service C). A and C are monitored by Elastic APM, B is monitored by another service.

Elastic APM is unable to link the transactions in A and C, since C's parent is a transaction in B (since the third-party service in B writes a traceparent header), which Elastic APM doesn't know about since B is monitored by a third-party service.

If we control B, the problem is easy to fix:

However, Google Cloud Run does its own monitoring, which doesn't look like it can be disabled (in this case A is a downstream service, B is the Cloud Run runtime, and C is the service hosted in Cloud Run).

As a workaround, I have forked the elastic-apm-node agent and made it use the elastic-apm-traceparent header by default instead of the traceparent header. However I have no idea what a good long-term solution would be. Any ideas?

Related: https://github.com/elastic/apm/issues/286

felixbarny commented 2 years ago

We're currently planning the implementation for what's been discussed in #286: #600

But I'm not sure that would cover the scenario you're describing. Are you saying that even when sending a request with a traceparent header to the Google Cloud Run runtime (B), it will create a new trace id when it calls service C?

basepi commented 2 years ago

This seems like a bug in Google Cloud Run. It should definitely not be overwriting an incoming traceparent. GCR should be name-spacing in this case, not using the W3C traceparent.

n-e commented 2 years ago

Here’s an example below with a (publicly accessible for now) cloud run service that echoes back the http request it has received.

The traceparent header received by the service has its parentId (and flags) changed.

curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'

Request served by localhost

HTTP/1.1 GET /

Host: test-nico-puofkkaada-ew.a.run.app
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*
Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a94c32b3846c1e53-01
X-Cloud-Trace-Context: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12199181237242043987;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https
felixbarny commented 2 years ago

At least the trace id is preserved, which is something.

What would help is if our UI would support showing traces that are incomplete. Meaning that if there are services A -> B -> C, and if only the spans from services A and C are in Elasticsearch, it should show these, even if C's parent (B) is not stored in Elasticsearch.

We have discussed these situations in the past already:

However, I wonder if this changes things and whether we should look at ways to make this work.

@sqren do you have thoughts on that?


However, the fact that the sampling flag is overridden by Google Cloud Run is problematic, too.

Other agents have config options that make the agent use a elastic-apm-traceparent header, instead of the standard traceparent header. So (re-)introducing that for the Node.js agent could also be an option.

@elastic/apm-agent-node-js WDYT?


But I agree with @basepi that this seems like a weird behavior from Google Cloud Run at best.

trentm commented 2 years ago

What would help is if our UI would support showing traces that are incomplete.

I think we definitely should look at APM UI being able to show traces that (a) have missing spans and/or (b) are missing the root transaction. I would expect this to be a relatively common situation: A customer is migrating a system to Elastic Observability but can't monitor all services. Or a customer is using some load-balancer or proxy that supports W3C trace-context (so it changes the parentId), but Elastic doesn't support getting its trace data. Or a customer has one of their services written in a language for which Elastic doesn't have an APM agent, etc.

Other agents have config options that make the agent use a elastic-apm-traceparent header, instead of the standard traceparent header. So (re-)introducing that for the Node.js agent could also be an option.

@felixbarny Isn't that option about adding elastic-apm-traceparent in addition to traceparent for outgoing requests? My understanding is that even with use_elastic_traceparent_header=true our APM agent will still prefer an incoming traceparent header if it exists. That's what the Node.js APM agent currently does.

The Node.js APM agent currently does have useElasticTraceparentHeader, and it defaults to true. It just isn't currently documented (not sure why).

this seems like a weird behavior from Google Cloud Run at best.

Is it weird that Cloud Run changes the parent-id? Perhaps, from its point of view, it is adding a span to the trace (the span being its Cloud Run dispatcher/load-balancer/whatever-they-call-it service.

@n-e Are you able to look at any of these traces in Google Cloud's trace viewer? https://cloud.google.com/trace/docs/viewing-details I'm curious if they show an added span for the Cloud Run service that routes the incoming HTTP request.

I agree that it is problematic that Cloud Run ignores the incoming "sampled" flag. (They document that here: https://cloud.google.com/run/docs/trace#trace_sampling_rate) However, they are still compliant with the spec, which says "The following are a set of suggestions that vendors SHOULD use to increase vendor interoperability. ..." (https://www.w3.org/TR/trace-context/#sampled-flag).

sorenlouv commented 2 years ago

@sqren do you have thoughts on that?

I agree we could do a better job of showing orphaned services. Afaict we won't be able to connect the orphaned service to the tree. WDYT about placing it at the very end of the trace labelling it "orphan" or similar?

Correct trace

Suggested view if an intermediary service is missing

n-e commented 2 years ago

@n-e Are you able to look at any of these traces in Google Cloud's trace viewer?

Query:

curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'
Request served by localhost

HTTP/1.1 GET /

Host: test-nico-puofkkaada-ew.a.run.app
Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a7b54b18ba85b13c-01
X-Cloud-Trace-Context: ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12084647744699216188;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*

Cloud Trace (there are the matching traceId and parentTraceId on the right):

image
knoring1 commented 1 year ago

In the W3C Trace Context they use the example of adding the latest vendor specific parent id to the tracestate to allow the vendor to pick up the latest parent id recorded even if the parent id in the traceparent header has been modified by an intermediate system. Wouldn't implementing support for that in the apm agents solve the issue mentioned without the use of the elastic-apm-traceparent header?

ty-elastic commented 9 months ago

I ran into this as well. would it make any sense to have the receiving APM client prefer elastic-apm-traceparent to traceparent if it is present? With Cloud Run and Elastic APM instrumenting a node service, it is difficult to strip the modified traceparent header pre-Elastic APM agent, and if its there, elastic-apm-traceparent is ignored...