DataDog / dd-sdk-ios

Datadog SDK for iOS - Swift and Objective-C.
Apache License 2.0
222 stars 129 forks source link

Trace sample rate does not work for child traces #2064

Closed shagedorn closed 3 weeks ago

shagedorn commented 2 months ago

Describe the bug

In our app, we have automatic network tracking/tracing enabled, but also send a few custom traces. We apply the trace sample rate in several places in the Datadog setup (shortened for brevity):

let traceSampleRate = 5.0

RUM.enable(with: .init(
  …
  urlSessionTracking: .init(
    firstPartyHostsTracing: .trace(
      hosts: […],
      // apply sample rate to network tracking
      sampleRate: traceSampleRate
    )
  )
))

Trace.enable(with: .init(
  // apply sample rate to custom traces
  sampleRate: traceSampleRate,
  urlSessionTracking: .init(
    firstPartyHostsTracing: .trace(
      hosts: […],
      // apply sample rate to network traces
      sampleRate: traceSampleRate
    )
  ),
  …
))

We noticed in production that some of our custom traces show up much more often than others and suspected an issue with the sampling rate.

Reproduction steps

We were able to track it down by setting the sample rate to 0 (in our debug environment), and could see that a few traces always come through regardless. It turns out all custom traces with a parent trace are affected, regardless if we set the parent explicitly or Datadog detects it automatically. So, in pseudocode:

trace1.setActive()
  trace2.setActive()
  trace2.finish()
trace1.finish()

…would lead to trace2 escaping the downsampling.

trace1.setActive()
trace1.finish()

trace2.setActive()
trace2.finish()

…whereas this works as expected – given a sample rate of 0, neither of them would be sent.

SDK logs

-

Expected behavior

The trace sample rate should govern all traces to effectively limit cost.

We will need to disable custom traces entirely until this is fixed because it produces costs we cannot control, so we hope for a quick solution.

Affected SDK versions

2.14.2 - 2.17.0

Latest working SDK version

unknown

Did you confirm if the latest SDK version fixes the bug?

Yes

Integration Methods

SPM

Xcode Version

Xcode 15.4

Swift Version

Swift 5.9

MacOS Version

macOS Sonoma 14.7 (23H124)

Deployment Target

iOS 16

Device Information

Reproduces in simulators and devices

Other relevant information

No response

ncreated commented 1 month ago

Hey @shagedorn 👋.

Thanks a lot for providing the reproduction steps; that’s always incredibly helpful. However, in this case, I couldn't reproduce the issue. I tried the following minimal setup with a trace sampleRate of 0:

import DatadogCore
import DatadogTrace

Datadog.initialize(
    with: .init(clientToken: "<client-token>", env: "<env>"),
    trackingConsent: .granted
)
Datadog.verbosityLevel = .debug

Trace.enable(
    with: .init(sampleRate: 0)
)

for _ in (0..<10) {
    let span1 = Tracer.shared().startSpan(operationName: "parent").setActive()
    let span2 = Tracer.shared().startSpan(operationName: "child").setActive()
    span2.finish()
    span1.finish()
}

No spans were indexed.

Here are a few important points to keep in mind about APM sampling, particularly when using a sampleRate of 0 or higher:

sampleRate: 0

Even with a sampleRate of 0, all spans are sent from the SDK and ingested by Datadog APM. These spans will appear in the LIVE view (APM Traces), but because none of them will be indexed, they won’t contribute to your billing:

sampleRate: 0-100

Since version 2.11.0, DatadogTrace supports head-based sampling (see also: https://github.com/DataDog/dd-sdk-ios/issues/1713). With this feature:

For this reason, it's important not to assume that a 50% sample rate will index 50% of spans. Instead, it will index 50% of traces.

sampleRate in Distributed Tracing

The sampleRate set in the urlSessionTracking API applies to distributed traces — the traces our SDK automatically creates for intercepted network requests. This sampling rate is also respected by compatible downstream backend tracers.

⚠ In your code snippet, I noticed that urlSessionTracking is configured for both DatadogRUM and DatadogTrace. This isn't necessary. In fact, DatadogTrace instrumentation skips creating a span if the request was already instrumented by RUM. I recommend only configuring urlSessionTracking for DatadogRUM in your setup, as the DatadogTrace option is mainly useful for non-RUM users.


@shagedorn, I hope this clears things up and aligns with your expectations. Let me know how it sounds or if you need further clarification.

shagedorn commented 1 month ago

Hi @ncreated, thanks a lot for all that context 🙏🏻 There's some new information in here, we're re-evaluating our initial report and will follow up with an update shortly!

shagedorn commented 1 month ago

Even with a sampleRate of 0, all spans are sent from the SDK and ingested by Datadog APM. These spans will appear in the LIVE view (APM Traces), but because none of them will be indexed, they won’t contribute to your billing

Soo… we were blissfully unaware of head-based sampling. Thank you for clarifying this!

For this reason, it's important not to assume that a 50% sample rate will index 50% of spans. Instead, it will index 50% of traces.

This is also helpful to know, thank you!

I recommend only configuring urlSessionTracking for DatadogRUM in your setup, as the DatadogTrace option is mainly useful for non-RUM users.

Thank you, we'll clean this up 🙏🏻


✅ With this context, our original bug report isn't valid anymore.

❓ However, I tried your example (and our own code) and I feel like there's an opposite bug, or another misunderstanding:

When I set the sample rate to 0, it seems that only the child span is displayed in the LIVE view:

Image

…whereas when I set it to 100, I see both child & parent, in the LIVE view and also in the indexed view:

Image

Is this expected behaviour?

maxep commented 1 month ago

Hello @shagedorn 👋

That is unexpected. I can easily replicate and I can confirm that the SDK is sending both span events when sampled out. I will keep you posted when we get to the bottom of this.

Thank you for the report!

maxep commented 3 weeks ago

Hello 👋

So I have more insights from the APM intake team: as @ncreated explained, the Live view presents events between ingestion and indexing. The sample decision for keeping a trace is based on the root span (the parent in your example), so when a child span is ingested, its sampling is ignored because it has a parent. Therefor, it is ingested (not yet indexed) and you see it in the Live view. But when the root span arrives, the intake will make the decision to drop it because it's sampled out, which explain why it is not seen in the LIVE view. Ultimately, the entire trace is dropped.

Here a more detailed description of the Ingestion flow: https://docs.datadoghq.com/tracing/trace_pipeline/ingestion_controls/

So this behaviour is by-design of our ingestion flow and the LIVE view.

I hope it's clear, let me know if you need more info!

shagedorn commented 3 weeks ago

Thanks a lot for digging into this and providing additional details! I will close this issue then, it seems that there's no bug, just a misinterpretation.