gc_count runtime metrics are reported as gauges but contain process counters

DataDog / dd-trace-rb

Datadog Tracing Ruby Client

Other

304 stars 370 forks source link

Current behaviour

All GC.stats are reported as gauges

Expected behaviour Counters produce graphs like this:

This isn't terribly useful as a gauge, and should be reported as a count.

This is a bit tricky as some of the other stats are _count but do represent gauges.

Steps to reproduce

Just enable ruby runtime metrics and graph runtime.ruby.gc.gc_count or runtime.ruby.gc.major_gc_count

Environment

datadog version:

1.23.3

Configuration block (Datadog.configure ...):

  require 'ddtrace'
  ::Datadog.configure do |c|
    c.agent.host = ENV["DDTRACE_HOST"]

    c.profiling.enabled = Settings.datadog.profiling_enabled
    c.tracing.log_injection = false

    c.tracing.instrument :rails, service_name: "x-app"
    c.tracing.instrument :active_support, cache_service: "x-cache"
    c.tracing.instrument :action_pack, service_name: "x-controller"
    c.tracing.instrument :active_model_serializers
    c.tracing.instrument :active_record, service_name: "x-postgres"
    c.tracing.instrument :pg, service_name: "postgres", comment_propagation: 'full'

    c.tracing.instrument :redis # service_name defaults to "redis"

    c.tracing.instrument :http #net/http
    c.tracing.instrument :rest_client

    c.runtime_metrics.enabled = true
    c.runtime_metrics.statsd = TFE::Clients.rtstatsd

Ruby version:

3.1.5

Operating system:

Linux (various) & MacOS

Relevant library versions:

Hey @SpamapS, we can't use "count" for GC.count because count would sum the number of times GC.count was reported, instead of only recording the latest value.

From our metrics docs: https://docs.datadoghq.com/metrics/types/?tab=count#metric-types

Suppose you are submitting a COUNT metric, notifications.sent, from a single host running the Datadog Agent. This host emits the following values in a flush time interval: [1,1,1,2,2,2,3,3].

The Agent adds all of the values received in one time interval. Then, it submits the total number, in this case 15, as the COUNT metric’s value.

Because GC.count will report the total number of GC cycles, we want it to be reported as a gauge.

Now, regarding the graph jumps you are seeing, these are likely caused by multiple Ruby processes with the same service name. The Runtime Metrics aggregate on a service level, no per individual process, thus causing such metrics to report inconsistent values. We are actively working on a solution as we speak.

DataDog / dd-trace-rb

gc_count runtime metrics are reported as gauges but contain process counters #3832