Environment based sampling rules are not being honored vs opentracing implementation

Smeb commented 9 months ago

Hi - we upgraded to Istio 1.19.5, and thereby to 0.1.8. We noticed the following changes:

some error traces that were there are now not marked as errors
the sample rate of traces went to 100%

This is for an Istio gateway service running Istio 1.19.5, Envoy 1.27, dd-trace-cpp 0.1.8.

We're trying to mitigate the second issue, but so far the environment variables don't seem to have an effect.

Originally, we had the following environment variable set:

- name: DD_TRACE_SAMPLING_RULES
    value: '[]'

Which comes from here. With opentracing, this meant the agent was controlling the sample rate.

With the new library, this does not seem to happen.

We've also tried the following configurations (without success):

DD_TRACE_SAMPLING_RULES='[{"service": "external-ingressgateway.istio-system","sample_rate": 0.1}]'
DD_TRACE_SAMPLING_RULES='[{"sample_rate": 0.1}]'
DD_TRACE_SAMPLING_RULES='[{"sample_rate": 0.1}]'
DD_TRACE_SAMPLE_RATE=0.1

Unfortunately, none of them had an effect.

The one thing that did work, was setting DD_TRACE_ENABLED=false, which, as expected, disabled the tracing (showing that at least that environment variable is being used).

Other (maybe) relevant configurations: we have istio tracing set to 100%, since the datadog configuration is expected to handle dropping traces.

Smeb commented 9 months ago

As a first step, I opened backports for 1.27/1.28 on envoy of the latest release.

dgoffredo commented 9 months ago

Hi, @Smeb.

The bug where Envoy's Datadog tracer does not set the "error" property on spans that are considered errors by Envoy was fixed in a recent set of PRs to Envoy (main, v1.27, v1.28). The fix has not yet found its way into Istio.

None of the Envoy/Istio Datadog bugs that customers have been encountering since v1.27 are due to dd-trace-cpp. Instead, they are bugs that I introduced to Envoy itself while rewriting Envoy's Datadog extension to use dd-trace-cpp (as opposed to the older OpenTracing-based dd-opentracing-cpp).

I don't think that it's appropriate to backport the latest release of dd-trace-cpp onto Envoy's release branches. Those release branches can continue to use whichever versions of dd-trace-cpp they use. The bugs, which are in Envoy's source tree, have all been backported already.

What's new to me is how your trace sampling went to 100%. The only sampling-related bug that I'm aware of in Envoy v1.27 could cause fewer traces to be sampled, but never more.

This sampling-related change that you've encountered is best debugged with your system's complete Istio configuration in hand (whatever that might mean).

You can open a Support ticket, which will likely get escalated to me, and/or we can investigate the sampling issue here on GitHub.

My initial goal will be to get an Istio configuration that allows me to reproduce the unexpected 100% sampling behavior. For this the following from you would be helpful: complete YAML files, helm charts, etc.

If any of this information is sensitive, then we can instead carry on in Datadog's Support system (linked above).

Smeb commented 9 months ago

I can gather that information tomorrow - if it helps here are some relevant details.

In the datadog UI, we see that traces are automatic, not configured. From my understanding if the rules were working, it would be tagged with rule.

With proxy debug logs on tracing, this is the configuration:

{
  "collector":  {
    "config": {
      "event_scheduler": {
        "type": "Envoy::Extensions::Tracers::Datadog::EventScheduler"
      },
      "flush_interval_milliseconds":2000,
      "http_client":{"type":"Envoy::Extensions::Tracers::Datadog::AgentHTTPClient"},"url":"http://xxx.xxx.xxx.xxx:8126/v0.4/traces"},
      "type":"datadog::tracing::DatadogAgent"},
      "defaults":{"environment":"sandbox","name":"envoy.proxy","service":"external-ingressgateway.external-ingressgateway","service_type":"web","version":"envoy d475eda09052cbf4ee25157f68b13abdb7b17c2f/1.27.3-dev/Clean/RELEASE/BoringSSL"},
      "environment_variables":{"DD_AGENT_HOST":"xxx.xxx.xxx.xxx","DD_ENV":"sandbox","DD_SERVICE":"external-ingressgateway.external-ingressgateway","DD_TRACE_AGENT_PORT":"8126","DD_TRACE_ENABLED":"true","DD_TRACE_SAMPLING_RULES":"[{"sample_rate":0.1}]"},
      "extraction_styles":["tracecontext","Datadog"],
      "injection_styles":["tracecontext","Datadog"],
      "span_sampler":{
        "rules":[]
      },
      "tags_header_size":512,
      "trace_sampler":{
        "max_per_second":200.0,
        "rules": [
          {"name":"*","resource":"*","sample_rate":0.1,"service":"*"}
        ]
      },
      "version": [dd-trace-cpp version v0.1.8]
}

Some notes on the above - sandbox is a test cluster, and I'm using envoy proxy nightlies to get the fixed version of the proxy code from your changes two weeks ago.

I also tried setting span sampling - that didn't help unfortunately.

~I also set the rate limit to 1 trace a second, which also did not stop the traces. I ran a load test with 100qps, and got 100 traces back per second. This to me points to the trace sampling not even being part of the decision making.~

With the nightly the error issue appears to be fixed, which is great.

Tomorrow I will try to get this running/reproduced in kind, and then I can get you some charts.

Smeb commented 9 months ago

Update - the behaviour setting traces per second is a bit odd. In the ingestion dashboard datadog reports 180 traces ingested a second (it should be 3). In APM, there are around 5000 spans over 30 minutes, which is roughly what you would expect.

In this case I suppose I should trust the ingestion dashboard (since I believe that's what we get billed for)

Update to the update - I expect the indexed spans is just lagged. Now it shows 750,000 spans.

dgoffredo commented 9 months ago

Thanks for the info and explanation, @Smeb.

I'll be unavailable until Monday next week. Here are some thoughts until then.

There are multiple layers of sampling configuration involved.

At the bottom, there is the sampling configuration of dd-trace-cpp. Based on the logged configuration banner that you shared, dd-trace-cpp is configured to keep 10% of traces that begin with that tracer. If the trace begins elsewhere, then the sampling is determined elsewhere. If the traces begin with the Envoy that Istio has configured, then all traces will be sent to the Datadog Agent, but only 10% of them will be marked as "send these to Datadog." The sampling configuration of dd-trace-cpp can be overridden by environment variables (DD_TRACE_SAMPLE_RATE, DD_TRACE_SAMPLING_RULES).
Then there is Envoy. Envoy has its own sampling rate percentage. Assuming that you have the fix for the bug that I mentioned in a previous comment, then here's how it works:
- If a trace is coming from the client, then use its sampling decision. Envoy's configuration then does not matter (and neither does dd-trace-cpp's).
- If Envoy is at the beginning of the trace, then consider Envoy's configured sampling rate. If the resulting decision is "drop," then tell dd-trace-cpp to drop the trace. If the resulting decision is "keep," then tell dd-trace-cpp to make its own sampling decision, which it will make based on its sampling configuration.
  1. At the top is Istio. Istio's configuration (maybe IstioOperator, maybe a helm chart's values.yaml, maybe an argument passed to istioctl...) has associated with it a sampling rate for the "pilot," and optionally for the egress and/or ingress "gateways." Istio uses this sampling rate to configure Envoy's sampling rate. Datadog's documentation recommends that Istio's sampling rate be set to 100%, and that dd-trace-cpp's sampling rate be set using an environment variable.
    - But where do you put the environment variable? The documentation uses the example of a Deployment that has Istio's sidecar injection enabled. In that case, there is a special annotation (apm.datadoghq.com/env) that can be set on the Deployment in order to specify Datadog-specific environment variables for the injected sidecar (Envoy). Based on the configuration JSON you shared from the log, this part seems to be working.

The question is, why now are you seeing 100% of traces in the Datadog UI, when previously you saw only 10%?

It might be helpful to pursue this in Datadog's Support system, as our Support representatives can see the things in the Datadog UI that you are seeing, and this would help to diagnose the issue.

For now, it would be helpful to have the complete configuration of Istio in your system. For example, are you using an IstioOperator? These traces for which you are configuring sampling: Do they involve service sidecars, ingress gateways, or egress gateways (or some combination)? How are those Istio components configured?

Again, the goal is for me to have a reproduction of the issue on my test Kubernetes cluster.

Smeb commented 8 months ago

Small follow up - we're just working on getting a minimal configuration we can share. In parallel we're reaching out to support, so hopefully we should have something we can send to you soon. We can't share the whole current configuration here as is, plus I think it probably makes sense if we try to find a minimal reproduction, since that should hopefully help to pinpoint the issue.

Smeb commented 8 months ago

Bringing some news (but no configs). I believe this should help narrowing down the problem though.

In our setup we run with a lua filter which is used to call a container running on the same pod
Nothing changed about the container itself
When I remove the lua filter, the tracing behaviour seems to respect the sample rate (e.g I can set it to 100%, 10% and see the distribution of traces I would expect).
I believe this could be something to do with sampling set on the lua filter... but I don't have a working knowledge of how that's implemented in Envoy. This looked interesting though.
I believe it could be related to the behaviour here where a caller can set the span to be definitely sampled.

I'm continuing to trim our config, but this seems to narrow it down. I'll try with a no-op lua filter and see if that shows the same behaviour.

dgoffredo commented 8 months ago

Thanks for the update, @Smeb. I'm unfamiliar with Envoy's Lua extension and how it might interact with tracing and with Istio.

That will have to be part of my attempted reproduction once you have a configuration for me. I appreciate your putting forth the effort to produce a minimal example, that's very helpful.

dgoffredo commented 8 months ago

@Smeb I think that you have indeed found the root cause. It's another difference in behavior in Envoy's Datadog tracing extension between Envoy <= 1.26 and Envoy >= 1.27.

In the old OpenTracing-based code, "overriding" the sampling decision looked like this:

trace_sampled
- setSampled
  - SetTag
    - setSamplingPriority
      - setSamplingPriorityFromUser
        
        setSamplingPriorityFromUserImpl

bool sampling_priority_locked can be true in two scenarios:

when a sampling decision was extracted from an incoming request (i.e. Envoy is not the root service), or
when trace context was injected into another service (i.e. Envoy has at least one child).

In the two scenarios above, Lua setting the sampling decision would do nothing.

In the new dd-trace-cpp based code, overriding the sampling decision looks like this:

trace_sampled
- setSampled
  - override_sampling_priority

The new code performs an unconditional override of the sampling decision, even if one has been made previously.

In the two scenarios above, Lua setting the sampling decision could override the configured sampling behavior.

I'm not sure what is best to do about this. More information about your setup would still be helpful.

One thing that I could do is alter setSampled so that it ignores its argument whenever a sampling decision has already been made. I could then backport that change onto Envoy v1.27, v1.28, and v1.29, and you'd have to either wait for Istio to pick up the change or use a non-release version of Istio's Enovy-based proxy.

Or, perhaps the new behavior is preferable, and there's a way that you can prevent the Lua extension from modifying the sampling decision.

What do you think?

Smeb commented 8 months ago

If this is the root cause, awesome!

re: our setup, I'm currently experimenting with just changing the envoy lua example to also use datadog tracing - this seemed easier than trying to trim down our config, which uses istio operator, sidecars et al. Really we just want the Envoy bit.

In our configuration though, I did find the following:

If in Envoy I completely remove the lua filter, then the tracing behaviour is the old behaviour.
If I just add a simple filter for healthchecks, those start tracing on every call. We don't make any other configuration changes (that I could tell)

This is the istio patch logic:

    patch:
      operation: INSERT_BEFORE
      value:
       name: envoy.lua
       typed_config:
        "@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua"
        inlineCode: |
          function on_healthz_call(cluster, request_handle)
              local req_headers = {
                  [":authority"] = "sidecar",
                  [":method"] = "GET",
                  [":path"] = "/healthz/ready",
              }

              local resp_headers
              local _, err = pcall(
              function ()
                  resp_headers, _ = request_handle:httpCall(cluster, req_headers, "", 1000)
              end)

              if err then
                  request_handle:respond({ [":status"] = "503" }, "")
              end

              request_handle:respond({ [":status"] = resp_headers[":status"] }, "")
          end

The value part of which I think should be possible to just drop directly into a configuration.

Regarding the behaviour, we would prefer the old behaviour, since I don't see an obvious way to configure the tracing behaviour itself in these docs.

dgoffredo commented 8 months ago

I haven't tested this yet, but perhaps this is what's happening:

Your httpCall in Lua does not specify a table argument (instead, you pass the timeout integer). So, the configuration passed to the underlying HTTP call in the C++ code comes from here.
- HttpCallOptions contains RequestOptions.
- RequestOptions contains a absl::optional<bool> sampled_{true};. It defaults to true, not to nullopt.
  - sampled_ is used here, which triggers the rest of the call stacks that I described in my previous comment.

So, it looks like the documented contract for Span::setSampled disagrees with current usage. There is a mismatch between RequestOptions::sampled_'s default value of true (as opposed to nullopt) and Span::setSampled's contract, which states:

This method overrides any previous sampling decision associated with the trace instance. If the sampled parameter is false, this span and any subsequent child spans are not reported to the tracing system.

The question is still whether I should change Envoy's Datadog tracing extension to the old OpenTracing behavior, or if I should take this up with the maintainers of Envoy. Perhaps either RequestOptions::sampled_ should default to nullopt, or the Lua filter extension should set the value to nullopt if no trace_sampled table option is passed to httpCall in Lua.

Maybe it's best for me to submit a PR to update the contract of Span::setSampled, and see what the Envoy maintainers think.

In the mean time, there's the question of what to do about the sampling of your system. Even if you explicitly specify a sampling decision in Lua, there does not appear to be a way to say "use whatever decision has already been made."

If there were a way to retrieve the current sampling decision in Lua, then you could use that as a workaround until the Datadog/Envoy/sampling question is decided, but I don't see a way to do that.

Perhaps it's best to restore the old behavior in Envoy's Datadog tracing extension, and leave the larger question for another time.

Any ideas?

dgoffredo commented 8 months ago

Any ideas?

I just talked through this with my teammate, and we have an idea.

Instead of changing the Datadog extension's Span::setSampled back to the old behavior, I can propose the following changes:

The "trace_sampled" handler for the table-containing overload of the httpCall Lua function currently coerces its argument into a boolean. Instead, allow it to accept nil in addition. This is technically a breaking change, since nil is falsy in Lua; the change would make it function as "no-op" in this context instead. Pass the resulting optional<bool> to RequestOptions::setSampled, instead of passing a bool. The null value, which is still not the default, would mean "don't override the sampling decision."

In your Lua code, use the table form of httpCall, and explicitly mention trace_sampled=nil, e.g.

resp_headers, _ = request_handle:httpCall(cluster, req_headers, "", {timeout_ms=1000, trace_sampled=nil})

I still would need to address the underlying contractual issue with the Envoy maintainers, but this is a workaround that you and others could use, and it minimizes any possible unintended side effects.

Smeb commented 8 months ago

Sorry for the delay in replying - timezones.

My suggestion would be to restore the old behaviour, and then discuss with Envoy about the possibility of updating the lua tracing behaviour in the lua filter for the future. I think the workaround comes with a few issues:

It requires configuration by everyone who needs it, where most consumers would prefer safe defaults/least surprise. Ideally the 'default' (non-overloaded) version should work as expected, and migrating between versions should require no changes to user configurations.
It potentially impacts every trace library interacting with the lua filter, which means it'll be more difficult to upstream. I think in the interest of solving the user problem quickly, it makes sense to localise changes to the datadog part of the extension.
There could be other places in the code where setSampled is called which aren't in the lua filter, which would still be breaking if they have similar behaviour.

That said, I do agree that the current usage is problematic and worth addressing in the long term. Also, thanks for the investigative work - all the code links really help clarify the source of the issue.

dgoffredo commented 8 months ago

Yes, I think I'll restore the old behavior in Envoy's Datadog extension, and open a separate issue in Envoy about the intended behavior of Span::setSampled.

Smeb commented 8 months ago

Hi - just to follow up, is there a rough ETA on reverting the behaviour? Happy to help to expedite this if there's any way we can support.

dgoffredo commented 8 months ago

I'll throw some code at it today. Will let you know if I expect a delay after that.

dmehala commented 6 months ago

Hi @Smeb,

Sorry for the delay. I'll take care of @dgoffredo's work. I'm on PTO now, but I'll make it my top priority when I'm back the week of April 8th.

Thank you for your patience.

Smeb commented 6 months ago

Thanks for the follow up 👍

DataDog / dd-trace-cpp

Environment based sampling rules are not being honored vs opentracing implementation #92