Open Smeb opened 9 months ago
Hi, @Smeb.
The bug where Envoy's Datadog tracer does not set the "error" property on spans that are considered errors by Envoy was fixed in a recent set of PRs to Envoy (main, v1.27, v1.28). The fix has not yet found its way into Istio.
None of the Envoy/Istio Datadog bugs that customers have been encountering since v1.27 are due to dd-trace-cpp. Instead, they are bugs that I introduced to Envoy itself while rewriting Envoy's Datadog extension to use dd-trace-cpp (as opposed to the older OpenTracing-based dd-opentracing-cpp).
I don't think that it's appropriate to backport the latest release of dd-trace-cpp onto Envoy's release branches. Those release branches can continue to use whichever versions of dd-trace-cpp they use. The bugs, which are in Envoy's source tree, have all been backported already.
What's new to me is how your trace sampling went to 100%. The only sampling-related bug that I'm aware of in Envoy v1.27 could cause fewer traces to be sampled, but never more.
This sampling-related change that you've encountered is best debugged with your system's complete Istio configuration in hand (whatever that might mean).
You can open a Support ticket, which will likely get escalated to me, and/or we can investigate the sampling issue here on GitHub.
My initial goal will be to get an Istio configuration that allows me to reproduce the unexpected 100% sampling behavior. For this the following from you would be helpful: complete YAML files, helm charts, etc.
If any of this information is sensitive, then we can instead carry on in Datadog's Support system (linked above).
I can gather that information tomorrow - if it helps here are some relevant details.
automatic
, not configured
. From my understanding if the rules were working, it would be tagged with rule
.With proxy debug logs on tracing, this is the configuration:
{
"collector": {
"config": {
"event_scheduler": {
"type": "Envoy::Extensions::Tracers::Datadog::EventScheduler"
},
"flush_interval_milliseconds":2000,
"http_client":{"type":"Envoy::Extensions::Tracers::Datadog::AgentHTTPClient"},"url":"http://xxx.xxx.xxx.xxx:8126/v0.4/traces"},
"type":"datadog::tracing::DatadogAgent"},
"defaults":{"environment":"sandbox","name":"envoy.proxy","service":"external-ingressgateway.external-ingressgateway","service_type":"web","version":"envoy d475eda09052cbf4ee25157f68b13abdb7b17c2f/1.27.3-dev/Clean/RELEASE/BoringSSL"},
"environment_variables":{"DD_AGENT_HOST":"xxx.xxx.xxx.xxx","DD_ENV":"sandbox","DD_SERVICE":"external-ingressgateway.external-ingressgateway","DD_TRACE_AGENT_PORT":"8126","DD_TRACE_ENABLED":"true","DD_TRACE_SAMPLING_RULES":"[{"sample_rate":0.1}]"},
"extraction_styles":["tracecontext","Datadog"],
"injection_styles":["tracecontext","Datadog"],
"span_sampler":{
"rules":[]
},
"tags_header_size":512,
"trace_sampler":{
"max_per_second":200.0,
"rules": [
{"name":"*","resource":"*","sample_rate":0.1,"service":"*"}
]
},
"version": [dd-trace-cpp version v0.1.8]
}
Some notes on the above - sandbox
is a test cluster, and I'm using envoy proxy nightlies to get the fixed version of the proxy code from your changes two weeks ago.
I also tried setting span sampling - that didn't help unfortunately.
~I also set the rate limit to 1 trace a second, which also did not stop the traces. I ran a load test with 100qps, and got 100 traces back per second. This to me points to the trace sampling not even being part of the decision making.~
With the nightly the error issue appears to be fixed, which is great.
Tomorrow I will try to get this running/reproduced in kind, and then I can get you some charts.
Update - the behaviour setting traces per second is a bit odd. In the ingestion dashboard datadog reports 180 traces ingested a second (it should be 3). In APM, there are around 5000 spans over 30 minutes, which is roughly what you would expect.
In this case I suppose I should trust the ingestion dashboard (since I believe that's what we get billed for)
Update to the update - I expect the indexed spans is just lagged. Now it shows 750,000 spans.
Thanks for the info and explanation, @Smeb.
I'll be unavailable until Monday next week. Here are some thoughts until then.
There are multiple layers of sampling configuration involved.
DD_TRACE_SAMPLE_RATE
, DD_TRACE_SAMPLING_RULES
).IstioOperator
, maybe a helm chart's values.yaml
, maybe an argument passed to istioctl
...) has associated with it a sampling rate for the "pilot," and optionally for the egress and/or ingress "gateways." Istio uses this sampling rate to configure Envoy's sampling rate. Datadog's documentation recommends that Istio's sampling rate be set to 100%, and that dd-trace-cpp's sampling rate be set using an environment variable.
Deployment
that has Istio's sidecar injection enabled. In that case, there is a special annotation (apm.datadoghq.com/env
) that can be set on the Deployment
in order to specify Datadog-specific environment variables for the injected sidecar (Envoy). Based on the configuration JSON you shared from the log, this part seems to be working.The question is, why now are you seeing 100% of traces in the Datadog UI, when previously you saw only 10%?
It might be helpful to pursue this in Datadog's Support system, as our Support representatives can see the things in the Datadog UI that you are seeing, and this would help to diagnose the issue.
For now, it would be helpful to have the complete configuration of Istio in your system. For example, are you using an IstioOperator
? These traces for which you are configuring sampling: Do they involve service sidecars, ingress gateways, or egress gateways (or some combination)? How are those Istio components configured?
Again, the goal is for me to have a reproduction of the issue on my test Kubernetes cluster.
Small follow up - we're just working on getting a minimal configuration we can share. In parallel we're reaching out to support, so hopefully we should have something we can send to you soon. We can't share the whole current configuration here as is, plus I think it probably makes sense if we try to find a minimal reproduction, since that should hopefully help to pinpoint the issue.
Bringing some news (but no configs). I believe this should help narrowing down the problem though.
I'm continuing to trim our config, but this seems to narrow it down. I'll try with a no-op lua filter and see if that shows the same behaviour.
Thanks for the update, @Smeb. I'm unfamiliar with Envoy's Lua extension and how it might interact with tracing and with Istio.
That will have to be part of my attempted reproduction once you have a configuration for me. I appreciate your putting forth the effort to produce a minimal example, that's very helpful.
@Smeb I think that you have indeed found the root cause. It's another difference in behavior in Envoy's Datadog tracing extension between Envoy <= 1.26 and Envoy >= 1.27.
In the old OpenTracing-based code, "overriding" the sampling decision looked like this:
bool sampling_priority_locked
can be true
in two scenarios:
In the two scenarios above, Lua setting the sampling decision would do nothing.
In the new dd-trace-cpp based code, overriding the sampling decision looks like this:
The new code performs an unconditional override of the sampling decision, even if one has been made previously.
In the two scenarios above, Lua setting the sampling decision could override the configured sampling behavior.
I'm not sure what is best to do about this. More information about your setup would still be helpful.
One thing that I could do is alter setSampled so that it ignores its argument whenever a sampling decision has already been made. I could then backport that change onto Envoy v1.27, v1.28, and v1.29, and you'd have to either wait for Istio to pick up the change or use a non-release version of Istio's Enovy-based proxy.
Or, perhaps the new behavior is preferable, and there's a way that you can prevent the Lua extension from modifying the sampling decision.
What do you think?
If this is the root cause, awesome!
re: our setup, I'm currently experimenting with just changing the envoy lua example to also use datadog tracing - this seemed easier than trying to trim down our config, which uses istio operator, sidecars et al. Really we just want the Envoy bit.
In our configuration though, I did find the following:
This is the istio patch logic:
patch:
operation: INSERT_BEFORE
value:
name: envoy.lua
typed_config:
"@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua"
inlineCode: |
function on_healthz_call(cluster, request_handle)
local req_headers = {
[":authority"] = "sidecar",
[":method"] = "GET",
[":path"] = "/healthz/ready",
}
local resp_headers
local _, err = pcall(
function ()
resp_headers, _ = request_handle:httpCall(cluster, req_headers, "", 1000)
end)
if err then
request_handle:respond({ [":status"] = "503" }, "")
end
request_handle:respond({ [":status"] = resp_headers[":status"] }, "")
end
The value
part of which I think should be possible to just drop directly into a configuration.
Regarding the behaviour, we would prefer the old behaviour, since I don't see an obvious way to configure the tracing behaviour itself in these docs.
I haven't tested this yet, but perhaps this is what's happening:
httpCall
in Lua does not specify a table argument (instead, you pass the timeout integer). So, the configuration passed to the underlying HTTP call in the C++ code comes from here.
RequestOptions
.absl::optional<bool> sampled_{true};
. It defaults to true
, not to nullopt
.
sampled_
is used here, which triggers the rest of the call stacks that I described in my previous comment.So, it looks like the documented contract for Span::setSampled
disagrees with current usage. There is a mismatch between RequestOptions::sampled_
's default value of true
(as opposed to nullopt
) and Span::setSampled
's contract, which states:
This method overrides any previous sampling decision associated with the trace instance. If the sampled parameter is false, this span and any subsequent child spans are not reported to the tracing system.
The question is still whether I should change Envoy's Datadog tracing extension to the old OpenTracing behavior, or if I should take this up with the maintainers of Envoy. Perhaps either RequestOptions::sampled_
should default to nullopt
, or the Lua filter extension should set the value to nullopt
if no trace_sampled table option is passed to httpCall
in Lua.
Maybe it's best for me to submit a PR to update the contract of Span::setSampled, and see what the Envoy maintainers think.
In the mean time, there's the question of what to do about the sampling of your system. Even if you explicitly specify a sampling decision in Lua, there does not appear to be a way to say "use whatever decision has already been made."
If there were a way to retrieve the current sampling decision in Lua, then you could use that as a workaround until the Datadog/Envoy/sampling question is decided, but I don't see a way to do that.
Perhaps it's best to restore the old behavior in Envoy's Datadog tracing extension, and leave the larger question for another time.
Any ideas?
Any ideas?
I just talked through this with my teammate, and we have an idea.
Instead of changing the Datadog extension's Span::setSampled
back to the old behavior, I can propose the following changes:
httpCall
Lua function currently coerces its argument into a boolean. Instead, allow it to accept nil
in addition. This is technically a breaking change, since nil
is falsy in Lua; the change would make it function as "no-op" in this context instead. Pass the resulting optional<bool>
to RequestOptions::setSampled
, instead of passing a bool
. The null value, which is still not the default, would mean "don't override the sampling decision."httpCall
, and explicitly mention trace_sampled=nil
, e.g.
resp_headers, _ = request_handle:httpCall(cluster, req_headers, "", {timeout_ms=1000, trace_sampled=nil})
I still would need to address the underlying contractual issue with the Envoy maintainers, but this is a workaround that you and others could use, and it minimizes any possible unintended side effects.
Sorry for the delay in replying - timezones.
My suggestion would be to restore the old behaviour, and then discuss with Envoy about the possibility of updating the lua tracing behaviour in the lua filter for the future. I think the workaround comes with a few issues:
It requires configuration by everyone who needs it, where most consumers would prefer safe defaults/least surprise. Ideally the 'default' (non-overloaded) version should work as expected, and migrating between versions should require no changes to user configurations.
It potentially impacts every trace library interacting with the lua filter, which means it'll be more difficult to upstream. I think in the interest of solving the user problem quickly, it makes sense to localise changes to the datadog part of the extension.
There could be other places in the code where setSampled
is called which aren't in the lua filter, which would still be breaking if they have similar behaviour.
That said, I do agree that the current usage is problematic and worth addressing in the long term. Also, thanks for the investigative work - all the code links really help clarify the source of the issue.
Yes, I think I'll restore the old behavior in Envoy's Datadog extension, and open a separate issue in Envoy about the intended behavior of Span::setSampled
.
Hi - just to follow up, is there a rough ETA on reverting the behaviour? Happy to help to expedite this if there's any way we can support.
I'll throw some code at it today. Will let you know if I expect a delay after that.
Hi @Smeb,
Sorry for the delay. I'll take care of @dgoffredo's work. I'm on PTO now, but I'll make it my top priority when I'm back the week of April 8th.
Thank you for your patience.
Thanks for the follow up 👍
Hi - we upgraded to Istio 1.19.5, and thereby to 0.1.8. We noticed the following changes:
This is for an Istio gateway service running Istio 1.19.5, Envoy 1.27, dd-trace-cpp 0.1.8.
We're trying to mitigate the second issue, but so far the environment variables don't seem to have an effect.
Originally, we had the following environment variable set:
Which comes from here. With opentracing, this meant the agent was controlling the sample rate.
With the new library, this does not seem to happen.
We've also tried the following configurations (without success):
Unfortunately, none of them had an effect.
The one thing that did work, was setting
DD_TRACE_ENABLED=false
, which, as expected, disabled the tracing (showing that at least that environment variable is being used).Other (maybe) relevant configurations: we have istio tracing set to 100%, since the datadog configuration is expected to handle dropping traces.