DataDog / dd-trace-cpp

C++ APM SDK
https://www.datadoghq.com
Apache License 2.0
8 stars 10 forks source link

Environment based sampling rules are not being honored vs opentracing implementation #92

Open Smeb opened 9 months ago

Smeb commented 9 months ago

Hi - we upgraded to Istio 1.19.5, and thereby to 0.1.8. We noticed the following changes:

This is for an Istio gateway service running Istio 1.19.5, Envoy 1.27, dd-trace-cpp 0.1.8.

We're trying to mitigate the second issue, but so far the environment variables don't seem to have an effect.

Originally, we had the following environment variable set:

- name: DD_TRACE_SAMPLING_RULES
    value: '[]'

Which comes from here. With opentracing, this meant the agent was controlling the sample rate.

With the new library, this does not seem to happen.

We've also tried the following configurations (without success):

DD_TRACE_SAMPLING_RULES='[{"service": "external-ingressgateway.istio-system","sample_rate": 0.1}]'
DD_TRACE_SAMPLING_RULES='[{"sample_rate": 0.1}]'
DD_TRACE_SAMPLING_RULES='[{"sample_rate": 0.1}]'
DD_TRACE_SAMPLE_RATE=0.1

Unfortunately, none of them had an effect.

The one thing that did work, was setting DD_TRACE_ENABLED=false, which, as expected, disabled the tracing (showing that at least that environment variable is being used).

Other (maybe) relevant configurations: we have istio tracing set to 100%, since the datadog configuration is expected to handle dropping traces.

Smeb commented 9 months ago

As a first step, I opened backports for 1.27/1.28 on envoy of the latest release.

dgoffredo commented 9 months ago

Hi, @Smeb.

The bug where Envoy's Datadog tracer does not set the "error" property on spans that are considered errors by Envoy was fixed in a recent set of PRs to Envoy (main, v1.27, v1.28). The fix has not yet found its way into Istio.

None of the Envoy/Istio Datadog bugs that customers have been encountering since v1.27 are due to dd-trace-cpp. Instead, they are bugs that I introduced to Envoy itself while rewriting Envoy's Datadog extension to use dd-trace-cpp (as opposed to the older OpenTracing-based dd-opentracing-cpp).

I don't think that it's appropriate to backport the latest release of dd-trace-cpp onto Envoy's release branches. Those release branches can continue to use whichever versions of dd-trace-cpp they use. The bugs, which are in Envoy's source tree, have all been backported already.

What's new to me is how your trace sampling went to 100%. The only sampling-related bug that I'm aware of in Envoy v1.27 could cause fewer traces to be sampled, but never more.

This sampling-related change that you've encountered is best debugged with your system's complete Istio configuration in hand (whatever that might mean).

You can open a Support ticket, which will likely get escalated to me, and/or we can investigate the sampling issue here on GitHub.

My initial goal will be to get an Istio configuration that allows me to reproduce the unexpected 100% sampling behavior. For this the following from you would be helpful: complete YAML files, helm charts, etc.

If any of this information is sensitive, then we can instead carry on in Datadog's Support system (linked above).

Smeb commented 9 months ago

I can gather that information tomorrow - if it helps here are some relevant details.

With proxy debug logs on tracing, this is the configuration:

{
  "collector":  {
    "config": {
      "event_scheduler": {
        "type": "Envoy::Extensions::Tracers::Datadog::EventScheduler"
      },
      "flush_interval_milliseconds":2000,
      "http_client":{"type":"Envoy::Extensions::Tracers::Datadog::AgentHTTPClient"},"url":"http://xxx.xxx.xxx.xxx:8126/v0.4/traces"},
      "type":"datadog::tracing::DatadogAgent"},
      "defaults":{"environment":"sandbox","name":"envoy.proxy","service":"external-ingressgateway.external-ingressgateway","service_type":"web","version":"envoy d475eda09052cbf4ee25157f68b13abdb7b17c2f/1.27.3-dev/Clean/RELEASE/BoringSSL"},
      "environment_variables":{"DD_AGENT_HOST":"xxx.xxx.xxx.xxx","DD_ENV":"sandbox","DD_SERVICE":"external-ingressgateway.external-ingressgateway","DD_TRACE_AGENT_PORT":"8126","DD_TRACE_ENABLED":"true","DD_TRACE_SAMPLING_RULES":"[{"sample_rate":0.1}]"},
      "extraction_styles":["tracecontext","Datadog"],
      "injection_styles":["tracecontext","Datadog"],
      "span_sampler":{
        "rules":[]
      },
      "tags_header_size":512,
      "trace_sampler":{
        "max_per_second":200.0,
        "rules": [
          {"name":"*","resource":"*","sample_rate":0.1,"service":"*"}
        ]
      },
      "version": [dd-trace-cpp version v0.1.8]
}

Some notes on the above - sandbox is a test cluster, and I'm using envoy proxy nightlies to get the fixed version of the proxy code from your changes two weeks ago.

I also tried setting span sampling - that didn't help unfortunately.

~I also set the rate limit to 1 trace a second, which also did not stop the traces. I ran a load test with 100qps, and got 100 traces back per second. This to me points to the trace sampling not even being part of the decision making.~

With the nightly the error issue appears to be fixed, which is great.

Tomorrow I will try to get this running/reproduced in kind, and then I can get you some charts.

Smeb commented 9 months ago

Update - the behaviour setting traces per second is a bit odd. In the ingestion dashboard datadog reports 180 traces ingested a second (it should be 3). In APM, there are around 5000 spans over 30 minutes, which is roughly what you would expect.

In this case I suppose I should trust the ingestion dashboard (since I believe that's what we get billed for)

Update to the update - I expect the indexed spans is just lagged. Now it shows 750,000 spans.

dgoffredo commented 9 months ago

Thanks for the info and explanation, @Smeb.

I'll be unavailable until Monday next week. Here are some thoughts until then.

There are multiple layers of sampling configuration involved.

  1. At the bottom, there is the sampling configuration of dd-trace-cpp. Based on the logged configuration banner that you shared, dd-trace-cpp is configured to keep 10% of traces that begin with that tracer. If the trace begins elsewhere, then the sampling is determined elsewhere. If the traces begin with the Envoy that Istio has configured, then all traces will be sent to the Datadog Agent, but only 10% of them will be marked as "send these to Datadog." The sampling configuration of dd-trace-cpp can be overridden by environment variables (DD_TRACE_SAMPLE_RATE, DD_TRACE_SAMPLING_RULES).
  2. Then there is Envoy. Envoy has its own sampling rate percentage. Assuming that you have the fix for the bug that I mentioned in a previous comment, then here's how it works:
    • If a trace is coming from the client, then use its sampling decision. Envoy's configuration then does not matter (and neither does dd-trace-cpp's).
    • If Envoy is at the beginning of the trace, then consider Envoy's configured sampling rate. If the resulting decision is "drop," then tell dd-trace-cpp to drop the trace. If the resulting decision is "keep," then tell dd-trace-cpp to make its own sampling decision, which it will make based on its sampling configuration.
      1. At the top is Istio. Istio's configuration (maybe IstioOperator, maybe a helm chart's values.yaml, maybe an argument passed to istioctl...) has associated with it a sampling rate for the "pilot," and optionally for the egress and/or ingress "gateways." Istio uses this sampling rate to configure Envoy's sampling rate. Datadog's documentation recommends that Istio's sampling rate be set to 100%, and that dd-trace-cpp's sampling rate be set using an environment variable.
        • But where do you put the environment variable? The documentation uses the example of a Deployment that has Istio's sidecar injection enabled. In that case, there is a special annotation (apm.datadoghq.com/env) that can be set on the Deployment in order to specify Datadog-specific environment variables for the injected sidecar (Envoy). Based on the configuration JSON you shared from the log, this part seems to be working.

The question is, why now are you seeing 100% of traces in the Datadog UI, when previously you saw only 10%?

It might be helpful to pursue this in Datadog's Support system, as our Support representatives can see the things in the Datadog UI that you are seeing, and this would help to diagnose the issue.

For now, it would be helpful to have the complete configuration of Istio in your system. For example, are you using an IstioOperator? These traces for which you are configuring sampling: Do they involve service sidecars, ingress gateways, or egress gateways (or some combination)? How are those Istio components configured?

Again, the goal is for me to have a reproduction of the issue on my test Kubernetes cluster.

Smeb commented 8 months ago

Small follow up - we're just working on getting a minimal configuration we can share. In parallel we're reaching out to support, so hopefully we should have something we can send to you soon. We can't share the whole current configuration here as is, plus I think it probably makes sense if we try to find a minimal reproduction, since that should hopefully help to pinpoint the issue.

Smeb commented 8 months ago

Bringing some news (but no configs). I believe this should help narrowing down the problem though.

I'm continuing to trim our config, but this seems to narrow it down. I'll try with a no-op lua filter and see if that shows the same behaviour.

dgoffredo commented 8 months ago

Thanks for the update, @Smeb. I'm unfamiliar with Envoy's Lua extension and how it might interact with tracing and with Istio.

That will have to be part of my attempted reproduction once you have a configuration for me. I appreciate your putting forth the effort to produce a minimal example, that's very helpful.

dgoffredo commented 8 months ago

@Smeb I think that you have indeed found the root cause. It's another difference in behavior in Envoy's Datadog tracing extension between Envoy <= 1.26 and Envoy >= 1.27.

In the old OpenTracing-based code, "overriding" the sampling decision looked like this:

bool sampling_priority_locked can be true in two scenarios:

  1. when a sampling decision was extracted from an incoming request (i.e. Envoy is not the root service), or
  2. when trace context was injected into another service (i.e. Envoy has at least one child).

In the two scenarios above, Lua setting the sampling decision would do nothing.


In the new dd-trace-cpp based code, overriding the sampling decision looks like this:

The new code performs an unconditional override of the sampling decision, even if one has been made previously.

In the two scenarios above, Lua setting the sampling decision could override the configured sampling behavior.


I'm not sure what is best to do about this. More information about your setup would still be helpful.

One thing that I could do is alter setSampled so that it ignores its argument whenever a sampling decision has already been made. I could then backport that change onto Envoy v1.27, v1.28, and v1.29, and you'd have to either wait for Istio to pick up the change or use a non-release version of Istio's Enovy-based proxy.

Or, perhaps the new behavior is preferable, and there's a way that you can prevent the Lua extension from modifying the sampling decision.

What do you think?

Smeb commented 8 months ago

If this is the root cause, awesome!

re: our setup, I'm currently experimenting with just changing the envoy lua example to also use datadog tracing - this seemed easier than trying to trim down our config, which uses istio operator, sidecars et al. Really we just want the Envoy bit.

In our configuration though, I did find the following:

This is the istio patch logic:

    patch:
      operation: INSERT_BEFORE
      value:
       name: envoy.lua
       typed_config:
        "@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua"
        inlineCode: |
          function on_healthz_call(cluster, request_handle)
              local req_headers = {
                  [":authority"] = "sidecar",
                  [":method"] = "GET",
                  [":path"] = "/healthz/ready",
              }

              local resp_headers
              local _, err = pcall(
              function ()
                  resp_headers, _ = request_handle:httpCall(cluster, req_headers, "", 1000)
              end)

              if err then
                  request_handle:respond({ [":status"] = "503" }, "")
              end

              request_handle:respond({ [":status"] = resp_headers[":status"] }, "")
          end

The value part of which I think should be possible to just drop directly into a configuration.

Regarding the behaviour, we would prefer the old behaviour, since I don't see an obvious way to configure the tracing behaviour itself in these docs.

dgoffredo commented 8 months ago

I haven't tested this yet, but perhaps this is what's happening:

So, it looks like the documented contract for Span::setSampled disagrees with current usage. There is a mismatch between RequestOptions::sampled_'s default value of true (as opposed to nullopt) and Span::setSampled's contract, which states:

This method overrides any previous sampling decision associated with the trace instance. If the sampled parameter is false, this span and any subsequent child spans are not reported to the tracing system.

The question is still whether I should change Envoy's Datadog tracing extension to the old OpenTracing behavior, or if I should take this up with the maintainers of Envoy. Perhaps either RequestOptions::sampled_ should default to nullopt, or the Lua filter extension should set the value to nullopt if no trace_sampled table option is passed to httpCall in Lua.

Maybe it's best for me to submit a PR to update the contract of Span::setSampled, and see what the Envoy maintainers think.

In the mean time, there's the question of what to do about the sampling of your system. Even if you explicitly specify a sampling decision in Lua, there does not appear to be a way to say "use whatever decision has already been made."

If there were a way to retrieve the current sampling decision in Lua, then you could use that as a workaround until the Datadog/Envoy/sampling question is decided, but I don't see a way to do that.

Perhaps it's best to restore the old behavior in Envoy's Datadog tracing extension, and leave the larger question for another time.

Any ideas?

dgoffredo commented 8 months ago

Any ideas?

I just talked through this with my teammate, and we have an idea.

Instead of changing the Datadog extension's Span::setSampled back to the old behavior, I can propose the following changes:

I still would need to address the underlying contractual issue with the Envoy maintainers, but this is a workaround that you and others could use, and it minimizes any possible unintended side effects.

Smeb commented 8 months ago

Sorry for the delay in replying - timezones.

My suggestion would be to restore the old behaviour, and then discuss with Envoy about the possibility of updating the lua tracing behaviour in the lua filter for the future. I think the workaround comes with a few issues:

That said, I do agree that the current usage is problematic and worth addressing in the long term. Also, thanks for the investigative work - all the code links really help clarify the source of the issue.

dgoffredo commented 8 months ago

Yes, I think I'll restore the old behavior in Envoy's Datadog extension, and open a separate issue in Envoy about the intended behavior of Span::setSampled.

Smeb commented 8 months ago

Hi - just to follow up, is there a rough ETA on reverting the behaviour? Happy to help to expedite this if there's any way we can support.

dgoffredo commented 8 months ago

I'll throw some code at it today. Will let you know if I expect a delay after that.

dmehala commented 6 months ago

Hi @Smeb,

Sorry for the delay. I'll take care of @dgoffredo's work. I'm on PTO now, but I'll make it my top priority when I'm back the week of April 8th.

Thank you for your patience.

Smeb commented 6 months ago

Thanks for the follow up 👍