Open n-e opened 2 years ago
We're currently planning the implementation for what's been discussed in #286: #600
But I'm not sure that would cover the scenario you're describing. Are you saying that even when sending a request with a traceparent
header to the Google Cloud Run runtime (B), it will create a new trace id when it calls service C?
This seems like a bug in Google Cloud Run. It should definitely not be overwriting an incoming traceparent. GCR should be name-spacing in this case, not using the W3C traceparent.
Here’s an example below with a (publicly accessible for now) cloud run service that echoes back the http request it has received.
The traceparent header received by the service has its parentId (and flags) changed.
curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'
Request served by localhost
HTTP/1.1 GET /
Host: test-nico-puofkkaada-ew.a.run.app
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*
Traceparent: 00-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a94c32b3846c1e53-01
X-Cloud-Trace-Context: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12199181237242043987;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https
At least the trace id is preserved, which is something.
What would help is if our UI would support showing traces that are incomplete. Meaning that if there are services A -> B -> C, and if only the spans from services A and C are in Elasticsearch, it should show these, even if C's parent (B) is not stored in Elasticsearch.
We have discussed these situations in the past already:
However, I wonder if this changes things and whether we should look at ways to make this work.
@sqren do you have thoughts on that?
However, the fact that the sampling flag is overridden by Google Cloud Run is problematic, too.
Other agents have config options that make the agent use a elastic-apm-traceparent
header, instead of the standard traceparent
header. So (re-)introducing that for the Node.js agent could also be an option.
@elastic/apm-agent-node-js WDYT?
But I agree with @basepi that this seems like a weird behavior from Google Cloud Run at best.
What would help is if our UI would support showing traces that are incomplete.
I think we definitely should look at APM UI being able to show traces that (a) have missing spans and/or (b) are missing the root transaction. I would expect this to be a relatively common situation: A customer is migrating a system to Elastic Observability but can't monitor all services. Or a customer is using some load-balancer or proxy that supports W3C trace-context (so it changes the parentId), but Elastic doesn't support getting its trace data. Or a customer has one of their services written in a language for which Elastic doesn't have an APM agent, etc.
Other agents have config options that make the agent use a
elastic-apm-traceparent
header, instead of the standardtraceparent
header. So (re-)introducing that for the Node.js agent could also be an option.
@felixbarny Isn't that option about adding elastic-apm-traceparent
in addition to traceparent
for outgoing requests? My understanding is that even with use_elastic_traceparent_header=true
our APM agent will still prefer an incoming traceparent
header if it exists. That's what the Node.js APM agent currently does.
The Node.js APM agent currently does have useElasticTraceparentHeader
, and it defaults to true. It just isn't currently documented (not sure why).
this seems like a weird behavior from Google Cloud Run at best.
Is it weird that Cloud Run changes the parent-id? Perhaps, from its point of view, it is adding a span to the trace (the span being its Cloud Run dispatcher/load-balancer/whatever-they-call-it service.
@n-e Are you able to look at any of these traces in Google Cloud's trace viewer? https://cloud.google.com/trace/docs/viewing-details I'm curious if they show an added span for the Cloud Run service that routes the incoming HTTP request.
I agree that it is problematic that Cloud Run ignores the incoming "sampled" flag. (They document that here: https://cloud.google.com/run/docs/trace#trace_sampling_rate) However, they are still compliant with the spec, which says "The following are a set of suggestions that vendors SHOULD use to increase vendor interoperability. ..." (https://www.w3.org/TR/trace-context/#sampled-flag).
@sqren do you have thoughts on that?
I agree we could do a better job of showing orphaned services. Afaict we won't be able to connect the orphaned service to the tree. WDYT about placing it at the very end of the trace labelling it "orphan" or similar?
@n-e Are you able to look at any of these traces in Google Cloud's trace viewer?
Query:
curl https://test-nico-puofkkaada-ew.a.run.app -H 'Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-bbbbbbbbbbbbbbbb-00'
Request served by localhost
HTTP/1.1 GET /
Host: test-nico-puofkkaada-ew.a.run.app
Traceparent: 00-ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa-a7b54b18ba85b13c-01
X-Cloud-Trace-Context: ababaaaaaaaaaaaaaaaaaaaaaaaaaaaa/12084647744699216188;o=1
X-Forwarded-For: REDACTED
X-Forwarded-Proto: https
Forwarded: for="REDACTED";proto=https
User-Agent: curl/7.64.1
Accept: */*
Cloud Trace (there are the matching traceId and parentTraceId on the right):
In the W3C Trace Context they use the example of adding the latest vendor specific parent id to the tracestate
to allow the vendor to pick up the latest parent id recorded even if the parent id in the traceparent
header has been modified by an intermediate system. Wouldn't implementing support for that in the apm agents solve the issue mentioned without the use of the elastic-apm-traceparent
header?
I ran into this as well. would it make any sense to have the receiving APM client prefer elastic-apm-traceparent
to traceparent
if it is present? With Cloud Run and Elastic APM instrumenting a node service, it is difficult to strip the modified traceparent
header pre-Elastic APM agent, and if its there, elastic-apm-traceparent
is ignored...
In the W3C TraceContext specification it's possible that a system is monitored by multiple tracing tools.
Example: A -> B -> C (Service A calls Service B which calls Service C). A and C are monitored by Elastic APM, B is monitored by another service.
Elastic APM is unable to link the transactions in A and C, since C's parent is a transaction in B (since the third-party service in B writes a
traceparent
header), which Elastic APM doesn't know about since B is monitored by a third-party service.If we control B, the problem is easy to fix:
traceparent
headerHowever, Google Cloud Run does its own monitoring, which doesn't look like it can be disabled (in this case A is a downstream service, B is the Cloud Run runtime, and C is the service hosted in Cloud Run).
As a workaround, I have forked the elastic-apm-node agent and made it use the
elastic-apm-traceparent
header by default instead of thetraceparent
header. However I have no idea what a good long-term solution would be. Any ideas?Related: https://github.com/elastic/apm/issues/286