Open tjwp opened 4 years ago
Echoing the request for explicitly disabling span collection. It'd be great to have a supported way of avoiding trace submission issues with many thousands of spans. (Even having a way of detecting when the limit is reached, for logging purposes, would be helpful.)
We actually have a limit of 100,000 spans per thread (https://github.com/DataDog/dd-trace-rb/blob/a6b48452e9f6c551e557cc0c410f4d4012c79340/lib/ddtrace/context.rb#L26) and the tracer logs a message and emits a metric when the limit is exceeded.
This code that handles the span limit was introduced a long time ago, so I can see it not having the behaviour we desire today: The 100,000 limit is a pretty high barrier, we'd normally see issues consistently flushing spans much smaller than that; the log message is at debug
log, which is likely not enabled in production for most clients.
In the unlikely case that you think you might be hitting that limit, I suggest enabling health metrics:
Datadog.configure do |c|
c.diagnostics.health_metrics.enabled = true
end
You'll see a metric emitted under the name datadog.tracer.error.context_overflow
, representing how many times a tracer longer than 100,000 has been created. These metrics are very lightweight: their overhead cannot be measure in our internal testing environments, so it should be safe to enable. (You'll also get these new metrics for your application, mostly about the tracer health itself)
But regarding a way to disable tracing: this is something we've planned on doing before, as the trace itself need this in order to send HTTP requests to the agent without those request being traced. It's good to see there's an external need for it as well.
I'll let you know when we have updates, but this is definitely in our radar.
@marcotc that's helpful, thanks!
100,000 does seem quite high. That said, we did appear to hit some sort of limit with large spans in certain cases. Traces with low span counts would appear, but wouldn't if large span counts were expected. Then, when manually filtering the spans out (with a special span name that ties back to ::Datadog::Pipeline.before_flush(::Datadog::Pipeline::SpanFilter.new { ... })
configuration — our current workaround), I don't think we saw "dropped" traces for the relevant code path. But the metric you referenced will help us understand when it might be happening for known reasons at least!
Is there a way to disable tracing/spans for a specific piece of code?
The use case that I have is we are not seeing traces for some specific Sidekiq jobs. From other metrics I can see that there are some operations, that would generate spans, that are being performed >10K times within the job.
I suspect that we may be hitting the limit on the size of the overall trace and I'd like to stop collecting those inner >10K spans, and instead just manual instrument one span that provides the timing around those linesThat would be within the overall trace for the Sidekiq job, and anything else happening in the job outside that section of code.
Thanks!