DataDog / dd-trace-rb

Datadog Tracing Ruby Client
https://docs.datadoghq.com/tracing/
Other
304 stars 370 forks source link

gc_count runtime metrics are reported as gauges but contain process counters #3832

Open SpamapS opened 1 month ago

SpamapS commented 1 month ago

Current behaviour

All GC.stats are reported as gauges

Expected behaviour Counters produce graphs like this:

image

This isn't terribly useful as a gauge, and should be reported as a count.

This is a bit tricky as some of the other stats are _count but do represent gauges.

Steps to reproduce

Just enable ruby runtime metrics and graph runtime.ruby.gc.gc_count or runtime.ruby.gc.major_gc_count

Environment

1.23.3

  require 'ddtrace'
  ::Datadog.configure do |c|
    c.agent.host = ENV["DDTRACE_HOST"]

    c.profiling.enabled = Settings.datadog.profiling_enabled
    c.tracing.log_injection = false

    c.tracing.instrument :rails, service_name: "x-app"
    c.tracing.instrument :active_support, cache_service: "x-cache"
    c.tracing.instrument :action_pack, service_name: "x-controller"
    c.tracing.instrument :active_model_serializers
    c.tracing.instrument :active_record, service_name: "x-postgres"
    c.tracing.instrument :pg, service_name: "postgres", comment_propagation: 'full'

    c.tracing.instrument :redis # service_name defaults to "redis"

    c.tracing.instrument :http #net/http
    c.tracing.instrument :rest_client

    c.runtime_metrics.enabled = true
    c.runtime_metrics.statsd = TFE::Clients.rtstatsd

3.1.5

Linux (various) & MacOS

marcotc commented 1 month ago

Hey @SpamapS, we can't use "count" for GC.count because count would sum the number of times GC.count was reported, instead of only recording the latest value.

From our metrics docs: https://docs.datadoghq.com/metrics/types/?tab=count#metric-types

Suppose you are submitting a COUNT metric, notifications.sent, from a single host running the Datadog Agent. This host emits the following values in a flush time interval: [1,1,1,2,2,2,3,3].

The Agent adds all of the values received in one time interval. Then, it submits the total number, in this case 15, as the COUNT metric’s value.

Because GC.count will report the total number of GC cycles, we want it to be reported as a gauge.

Now, regarding the graph jumps you are seeing, these are likely caused by multiple Ruby processes with the same service name. The Runtime Metrics aggregate on a service level, no per individual process, thus causing such metrics to report inconsistent values. We are actively working on a solution as we speak.