DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
308 stars 375 forks source link

Sampling not working in 1.1.0 #2088

Open jorgemanrubia opened 2 years ago

jorgemanrubia commented 2 years ago

I don't manage to get sampling working. I have tried to target specific services:

Datadog.configure do |config|
  config.tracing.sampler = Datadog::Tracing::Sampling::PrioritySampler.new(
    post_sampler: Datadog::Tracing::Sampling::RuleSampler.new(
      [
        Datadog::Tracing::Sampling::SimpleRule.new(service: /^redis.*/, sample_rate: 0.1)
      ]
    )
  )
end

And also with a global one:

Datadog.configure do |config|
   config.tracing.sampler = Datadog::Tracing::Sampling::RateSampler.new(0.1)
end

In the APM ingestion control screen, I can't see any change in the ingestion rates, and all the services remain flagged as "AUTOMATIC". As per the docs I would expect them to be flagged as Configured and get a 90% reduction in the targeted services. I'm not getting any reduction, it's like if sampling wasn't being enabled at all.

jorgemanrubia commented 2 years ago

I managed to get the configured label to appear for action cable with:

  config.tracing.sampler = Datadog::Tracing::Sampling::PrioritySampler.new(
    post_sampler: Datadog::Tracing::Sampling::RuleSampler.new(
      [
        Datadog::Tracing::Sampling::SimpleRule.new(service: "action_cable", sample_rate: 0.1),
        Datadog::Tracing::Sampling::SimpleRule.new(service: /^redis.*/, sample_rate: 0.1)
      ]
    )
  )

But for the multiple redis-* services we have configured.

When I tried the global sampler referenced above, I didn't get any sampling or configured label for any service, in the ingestion control screen.

jorgemanrubia commented 2 years ago

Another example where sampling is not working. We have a custom service where we trace spans like this:

  Datadog::Tracing.trace "image_preview.generate" do |span, trace|
    span.service = "bc3-image-previews"

In the config:

  config.tracing.sampler = Datadog::Tracing::Sampling::PrioritySampler.new(
    post_sampler: Datadog::Tracing::Sampling::RuleSampler.new(
      [
        Datadog::Tracing::Sampling::SimpleRule.new(service: "bc3-image-previews", sample_rate: 0.1)
      ]
    )
  )

I didn't notice any ingestion reduction. And it's still showing AUTOMATIC the ingestion control screen:

CleanShot 2022-06-16 at 13 28 50@2x
delner commented 2 years ago

Hey @jorgemanrubia, my understanding is the RuleSampler targets the attributes of the root span. If you're targeting a child span further down the trace, that may explain why it's not picking it up.

The RateSampler should definitely sample down to your target rate however. We'll look into that and figure out what's going on. Sorry for the bumps in the road here: 1.0 was a big update for us, and we're working on smoothing things out.

delner commented 2 years ago

@marcotc May have some ideas about the sampler?

marcotc commented 2 years ago

@jorgemanrubia, could you try changing your code example to provide the service name as an argument to the #trace call, instead of setting it inside the traced block?

  Datadog::Tracing.trace("image_preview.generate", service = "bc3-image-previews") do |span, trace|

We made some sampling change in v1.0 to guarantee consistent sampling in distributed scenarios (where your Ruby applications calls out to another service also instrumented by Datadog APM). One of these changes was to ensure the sampling decision is done as early as possible, to ensure any HTTP requests sent from your Ruby process have the correct sampling decision attached to their header, ensuring the downstream APM service respects it.

Let us know if change works for you, and if you have any other concerns.

jorgemanrubia commented 2 years ago

@marcotc thanks for the tip, unfortunately changing that didn't make a difference. I changed it to:

Datadog::Tracing.trace "image_preview.generate", service: "bc3-image-previews" do |span, trace|
jorgemanrubia commented 2 years ago

It doesn't work for sampling mysql2 either:

  config.tracing.sampler = Datadog::Tracing::Sampling::PrioritySampler.new(
    post_sampler: Datadog::Tracing::Sampling::RuleSampler.new(
      [
        Datadog::Tracing::Sampling::SimpleRule.new(service: "bc3-image-previews", sample_rate: 0.1),
        Datadog::Tracing::Sampling::SimpleRule.new(service: "mysql2", sample_rate: 0.5)
      ]
    )
  )

For us, sampling not working is the most important ongoing issue, as it makes it very hard to put ingestion costs under control. We mitigated by disabling a bunch of services (such as redis), but we don't want to disable everything 😅.

muzfuz commented 2 years ago

We are also having this issue at the moment - I've done some digging around but haven't managed to find any reason why.

  Datadog.configure do |c|
    c.tracing.instrument :rails
    c.tracing.instrument :active_record, service_name: "db-service-name"

    c.tracing.sampler = Datadog::Tracing::Sampling::PrioritySampler.new(
      post_sampler: Datadog::Tracing::Sampling::RuleSampler.new(
        [
          Datadog::Tracing::Sampling::SimpleRule.new(service: "db-service-name", sample_rate: 0.0100)
        ]
      )
    )
end

The configuration appears to be correct, but sampling simply isn't occurring.

delner commented 2 years ago

Hey @jorgemanrubia , @muzfuz. We're discussing this one with the team, figuring out how to best address it.

Setting aside what the RuleSampler is actually doing right now, we'd like to get a little more info from you both about intent, and expectations of what you think it should be doing.

My hope is with these details we can figure out how to make the RuleSampler do what you need it to, or we can offer something else that better addresses the desired outcome.

tcannonfodder commented 2 years ago

We're also struggling with this as well, where we're trying to reduce the number of traces ingested for redis, since it's very stable & low-level service that has 1000x the ingestion rate & bytesize compared to other app-critical parts of the infrastructure. I've spent the weekend trying to reduce the sampling rate, and nothing is working :(

Using the code auto-generated by the Datadog setup tool for ingestion rate configuration

See:

c.tracing.sampler = Datadog::Tracing::Sampling::PrioritySampler.new(
      post_sampler: Datadog::Tracing::Sampling::RuleSampler.new(
        [
          # Sample all 'redis' traces at 0.000001%:
          Datadog::Tracing::Sampling::SimpleRule.new(service: 'redis', sample_rate: 0.000001)
        ]
      )
    )
delner commented 2 years ago

@tcannonfodder Per https://github.com/DataDog/dd-trace-rb/issues/2088#issuecomment-1157759367 the rules do not target child spans like redis, which are not the first span in the trace. I would not expect this code to work this way. If it did, it would render traces incomplete (missing some Redis spans, giving the illusion of fewer Redis commands).

Sample rates (at an app-level) do work, and will scale down your overall retained volume. Metrics should still be scaled properly. If Redis is of extreme volume (or generally uninteresting), you may want to consider disabling that instrumentation.

brafales commented 1 year ago

Hey @jorgemanrubia , @muzfuz. We're discussing this one with the team, figuring out how to best address it.

Setting aside what the RuleSampler is actually doing right now, we'd like to get a little more info from you both about intent, and expectations of what you think it should be doing.

  • Can you explain a bit more why you'd like to sample down these services? (Databases, caches?) Are they noisy?

    • What kinds of spans are worth tuning down? What kinds of spans do you want to keep?
    • Are you actually interested in tuning down the "service" or are you really interested in tuning down the spans from a particular "library"?
  • How are you identifying these services to target? And how are you determining the sample rates?
  • If you tune down a service's sample rate with the RuleSampler, what do you expect to see in the UI?

    1. Would targeted spans only appear on the trace based on the rate? (Thereby making for partial traces?)
    2. Or would the traces that have spans matching these rules appear based on the rate? (Thereby reducing the overall # of traces?)

My hope is with these details we can figure out how to make the RuleSampler do what you need it to, or we can offer something else that better addresses the desired outcome.

Hi @delner , for us the main reason is to control ingestion costs. We have a rails app being monitored where 75% of the ingested spans are database queries. What we'd like to see is a way for us to keep the root spans (e.g. an http request, or a sidekiq job) but for a % of those traces to not necessarily keep all the detail in child spans.

I am not sure how that'd be represented in the UI, it'd still be useful to know the % of time spent on a DB (or any other downstream service) because this is what would let us then "drill down" (for example temporarily removing the sampling to investigate will full traces), but most of the time we do not need all the details, or even don't care about certain things (like low level operations on redis for example).

  • Are you actually interested in tuning down the "service" or are you really interested in tuning down the spans from a particular "library"?

The answer to this question would be the latter. However the fact that Datadog's APM considers DB calls a service makes it quite confusing.

jorgemanrubia commented 1 year ago

@delner

Can you explain a bit more why you'd like to sample down these services? (Databases, caches?) Are they noisy?

Yes. More than noisy, they are very expensive to ingest because of their volume. We can't effectively control costs without sampling those.

What kinds of spans are worth tuning down? What kinds of spans do you want to keep?

Ideally, any kind of span? In our case, SQL traces are a big one, but there are others.

Are you actually interested in tuning down the "service" or are you really interested in tuning down the spans from a particular "library"?

I think sampling at the span level makes more sense. E.g: interested in ingesting all the jobs executions but not all the SQL queries triggered from those.

How are you identifying these services to target? And how are you determining the sample rates?

Empirically seeing ingestion volume/costs.

If you tune down a service's sample rate with the RuleSampler, what do you expect to see in the UI?

Ideally, some hint that there are missing spans due to sampling?

Would targeted spans only appear on the trace based on the rate? (Thereby making for partial traces?) Or would the traces that have spans matching these rules appear based on the rate? (Thereby reducing the overall # of traces?)

I think partial traces make more sense.

For me, it's hard to say what the right solution is here. But there is a clear problem for us: it's impossible to effectively control trace ingestion with the Rails APM integration right now.

delner commented 1 year ago

@jorgemanrubia So if I'm interpreting correctly, the summary of the issue is "too many spans = too expensive" and the general desired solution is "provide sampling controls per span"?

The complexity here in large part involves (1) sampling in distributed tracing and (2) how statistics are accurately compiled and presented in our back end. (1) is a difficult domain problem (given sampling decisions affect downstream services) but (2) may be more easily addressed if we can send/process the right signals regarding what kind of omission has occurred. I think we need to do some more evaluation on our end.

A sampling solution will take more time to implement, per the reasons above. However, in the interim, would having control to outright disable certain spans produced by our instrumentation be of any value? (e.g. omit all active_record.instantiation spans) This could help reduce volume/costs and would be faster for us to turn around.

delner commented 1 year ago

Also, neglected we added single span sampling in 1.5.0. This actually may address your need @jorgemanrubia: can you check out this and let me know what you think? https://github.com/DataDog/dd-trace-rb/issues/2105#issuecomment-1400810862

danielmklein commented 1 year ago

Just wanted to chime in to add that we are feeling some of the same pain here -- our "noisiest" spans are from things like redis and postgres, which represent the majority of our total ingested bytes but are downstream from eg the root rails controller request span. The main pain point for us is simply monetary: we're hitting on-demand APM ingestion charges, and it doesn't seem like there's an easy way to avoid it without ramping up sampling on our root services.

It doesn't seem like single span sampling addresses this in my initial testing, as it seems to be designed for hoisting spans that would have been dropped "up" to be ingested in order for metrics etc to get calculated accurately, where what we're really asking about here is the opposite of that.

I think the biggest point of confusion here stems from the way the tracing config section of the DD web dashboard gladly provides you a code snippet to copy and paste into your app to reduce your ingestion rate for a particular service, without touching on the fact that "well actually yeah this isn't a root service so this code will have no effect" (unless the user really goes digging in the docs, and we all know that we users hate doing that).

jorgemanrubia commented 1 year ago

It doesn't seem like single span sampling addresses this in my initial testing, as it seems to be designed for hoisting spans that would have been dropped "up" to be ingested in order for metrics etc to get calculated accurately, where what we're really asking about here is the opposite of that.

Same experience for me @delner. I added this for trying to reduce mysql2 trace ingestion, right before including the ddtrace gem, but it's not reducing mysql2 traces volume at all:

ENV["DD_SPAN_SAMPLING_RULES"] = <<~JSON
[ { "service": "mysql2", "name": "*", "sample_rate": 0.1 } ]
JSON
danielmklein commented 1 year ago

@delner am I correct in thinking that if we use a Datadog::Tracing::Sampling::PrioritySampler with a Tracing::Sampling::RuleSampler as its post-sampler with one or more Datadog::Tracing::Sampling::SimpleRules, this will override that way all error traces make it through to Datadog? Is there an easy-ish way to use SimpleRules while still letting error traces through?

marcotc commented 1 year ago

Sorry for the confusion about sampling, everyone.

Currently, Sampling for Datadog APM is only possible at trace-level, with the exception of Single Span Sampling.

At trace-level, you have full control over tracing: you can set percentage rates, rate limiting for a time frame, apply different sampling rates for different services. But all this has to be done for the trace graph as a single unit.

The exception, Single Span Sampling, can only be used to keep spans that are part of a trace that was dropped at trace-level. This is used to keep any spans that one deems crucial for business decision, but still wants the trace as a whole to not always be sampled. It cannot be used to drop specific spans from a trace that is kept by sampling.

Today, there's no perfect solution to sample out unwanted spans from a trace, unless removing it altogether is a valid option.

Suggestion

Given the options we have available today, I suggest using the Processing Pipeline to manipulate the trace before it's flushed.

For example, this snippet will remove all Redis spans created by ddtrace.

Datadog::Tracing.before_flush(
  Datadog::Tracing::Pipeline::SpanFilter.new { |span| span.name == 'redis.command' },
)

The tracing pipeline removes any children spans from the removed spans, if any, to ensure no orphan spans exist in the trace graph.

Any Ruby logic can go into the { |span| span.name == 'redis.command' } block, so feel free to perform conditional filtering or filter on different span attributes.

The Processing Pipeline will hard-drop spans, meaning no information about them will be propagated downstream. The upside is that this reduces ingestion traffic: no information about the dropped span leaves the Ruby process.

The most noticeable downside is that Trace Metrics for the dropped spans won't be calculated accurately by the backend: only spans that reach the backend can contribute to such metrics. This means that graphs and monitors that directly depend on Trace Metrics for the specific dropped spans won't be reliable. Trace Metrics for the rest of the trace will remain accurate, including total execution time.

jorgemanrubia commented 1 year ago

@marcotc thanks for your suggestion. I tried this for trying to get rid of mysql traces:

  Datadog::Tracing.before_flush(
    Datadog::Tracing::Pipeline::SpanFilter.new { |span| span.name != "mysql2.query" || rand < 0.5 },
  )

The effect is that it dramatically dropped traces across root services triggering MySQL queries, which broke a bunch of metrics. Also the drop was way larger than a 50%, which I don't understand. This is for the mysql2 service:

CleanShot 2023-03-17 at 17 48 20@2x

We've been suffering this problem since we started using Datadog, and we are now seeing this a blocker to continue with it. I'd like to expose our needs as simply and clearly as possible, to make sure I'm not missing anything here:

Is what we want possible?

jorgemanrubia commented 1 year ago

My confusion with Datadog::Tracing::Pipeline::SpanFilter is that the docs are inverted: they say that if the block returns true, the span is kept, when it's actually the opposite:

https://github.com/DataDog/dd-trace-rb/blob/master/lib/datadog/tracing/pipeline/span_filter.rb#L7-L14

delner commented 1 year ago

@jorgemanrubia Sorry for the delay in reply; we're going to take another look here and get back to you shortly. Standby.

delner commented 1 year ago

Regarding:

Datadog::Tracing.before_flush(
    Datadog::Tracing::Pipeline::SpanFilter.new { |span| span.name != "mysql2.query" || rand < 0.5 },
  )

If the block for SpanFilter evaluates to truthy, then the span will be dropped. Each span in the trace will be evaluated by this block. If a span is dropped, all of its child/grandchildren will be dropped as well (every span except "root" spans require a parent.)

Given a trace containing rack.request, rails.controller, mysql2.query, I would expect:

Is it simply a misapplication of SpanFilter? If I haven't miscalculated here, I think you want span.name == 'mysql2.query' && rand < 0.5.