DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
309 stars 375 forks source link

Memory bloat when tracing a long-running sidekiq job #2930

Closed cluesque closed 1 year ago

cluesque commented 1 year ago

Current behaviour

We have a sidekiq job that loops over rows in a CSV file, using ActiveRecord to insert records into our database.

When we run this over a file with a million records, we find the memory consumption of the sidekiq process to grow dramatically, and will sometimes exhaust the available memory and kill the host.

Debugging with ObjectSpace, periodically reporting on the most populous object classes, we find the number of Datadog::Tracing::Span objects growing as the process progresses.

If we comment out the c.tracing.instrument :sidekiq line in our config/initializers/datadog.rb the memory growth goes away.

Expected behaviour

I'm not totally certain this is unexpected behavior. It might be that ddtrace needs to capture data and can't meaningfully free it until the task is complete.

We'd like to have some way of limiting the memory footprint of instrumentation without disabling the feature altogether.

Is there a way of saying "don't instrument this block" perhaps?

Steps to reproduce

How does ddtrace help you?

Environment

  Datadog.configure do |c|
    c.tracing.instrument :rails, service_name: 'indigo'
    c.tracing.instrument :sidekiq, service_name: 'indigo-sidekiq', client_service_name: 'indigo-sidekiq-client'
    c.tracing.instrument :http, service_name: 'indigo-http' # Traces Net::HTTP calls
    c.tracing.instrument :aws, service_name: 'indigo-aws'
    c.tracing.instrument :rake, service_name: 'indigo-rake'
    c.tracing.instrument :redis, service_name: 'indigo-redis'
    c.tracing.instrument :redis, describes: ENV['indigo_redis_session_url'], service_name: 'indigo-redis-session'
    c.tracing.instrument :redis, describes: ENV['indigo_redis_cache_url'], service_name: 'indigo-redis-cache'
  end
TonyCTHsu commented 1 year ago

👋 @cluesque , did you still observe the memory bloat pattern if you disable Datadog profiling ?

Have you tried the partial flushing approach with enabling tracing.partial_flush.enabled ?

cluesque commented 1 year ago

@TonyCTHsu Adding c.tracing.partial_flush.enabled = true to our initializer did indeed fix the problem, thanks!

Is there a downside to turning that on? (Is there a reason it's not enabled by default?)

marcotc commented 1 year ago

Is there a downside to turning that on? (Is there a reason it's not enabled by default?)

Partial flush has a slightly increased total overhead, because performing the bulk serialization of a large trace is cheaper than multiple partial serializations of that same trace. The different shouldn't be too noticeable, though.

Also, in applications with non-trivial sampling rules, the sampling decision can be performed late in the trace lifecycle, but when ddtrace partial flushes it has to choose a sampling decision at flush time. This can cause partially flushed trace subsets to have a different sampling decision than the final trace. This means it's possible, but unlikely, that your application will send more partial traces than needed to the backend. In general, this mismatch between the sampling of subsets of the trace and the complete trace are unlikely, but are possible.

But as a general rule, partial flushing works as expected for the vast majority of workflows: we just cannot claim that it is the best solution across the board.

ivoanjo commented 1 year ago

(Since this doesn't appear to be a problem with profiling, I've taken the liberty of updating the ticket title)

peterfication commented 1 year ago

I had the same issue and tracing.partial_flush.enabled = true solved it 🎉