Sidekiq process metrics are not being cleared out

sosedoff commented 1 year ago

Im running my app in containers and do frequent deploys (cloud platform does not matter). Sidekiq processed are being scaled up and down based on the queue size, which means sidekiq will produce a bunch of metrics with a different hostname/identity tags. I did notice that a bunch of old metrics are still being reported in the prometheus_exporter metrics endpoint, though im not entirely sure if this is intentional or not.

Thought it was something to do with metrics collection in my setup, so i decided to replicate the setup locally:

Start prometheus_exporter process
Start rails server, to get some basic web metrics in, schedule a handful of sidekiq jobs
Start sidekiq, i've used around 5 separate worker processes (each running 5 worker threads)
Let sidekiq workers process jobs for a few mins
Check metrics

ruby_sidekiq_process_busy{labels="",queues="default",quiet="false",tag="superapp",hostname="dans-mbp14.lan",identity="dans-mbp14.lan:3347:a2992973bcff"} 5
ruby_sidekiq_process_busy{labels="",queues="default",quiet="false",tag="superapp",hostname="dans-mbp14.lan",identity="dans-mbp14.lan:3939:6ff3ccf91542"} 5
ruby_sidekiq_process_busy{labels="",queues="default",quiet="false",tag="superapp",hostname="dans-mbp14.lan",identity="dans-mbp14.lan:7298:1a91f2ff7eeb"} 5
ruby_sidekiq_process_busy{labels="",queues="default",quiet="false",tag="superapp",hostname="dans-mbp14.lan",identity="dans-mbp14.lan:1311:14ad6d54354e"} 5
ruby_sidekiq_process_busy{labels="",queues="default",quiet="false",tag="superapp",hostname="dans-mbp14.lan",identity="dans-mbp14.lan:8404:110645868ccf"} 5

ruby_sidekiq_stats_processes_size 5
ruby_sidekiq_stats_workers_size 25

That looks right, all my sidekiq processes are busy. Lets kill all but one, so we can still get sidekiq proc metrics in and simulate a scale down event. Wait a few mins to get metrics updated.

Now, checking the metrics endpoint again i still see the old worker stats, but ruby_sidekiq_stats_processes_size is reporting the correct process count 1.

Are the stale metrics supposed to be GC'ed? Otherwise it's a bit confusing to see the old values stuck around, especially if such metrics are used to determine scaling events.

SamSaffron commented 1 year ago

certainly looks like a bug to me we are missing a GC and customer collector for the metrics.

@mrsimo any thoughts here?

sosedoff commented 1 year ago

It looks like the metrics expiration is only checked during the ingest time, per https://github.com/discourse/prometheus_exporter/blob/main/lib/prometheus_exporter/server/sidekiq_process_collector.rb#L42

So per https://github.com/discourse/prometheus_exporter/blob/main/lib/prometheus_exporter/server/sidekiq_process_collector.rb#L24 we can still be serving stale metrics.

sosedoff commented 1 year ago

After further investigation, it appears that we've got a few issues:

Stale metrics are only deleted upon ingestion, so if all sidekiq procs are dead we are going to be stuck with old data
We're not resetting gauges in the metrics call in SidekiqProcessCollector class, which means old data is stuck around until process is restarted.

Here's a potential fix for clearing out stale metrics data:

diff --git a/lib/prometheus_exporter/server/sidekiq_process_collector.rb b/lib/prometheus_exporter/server/sidekiq_process_collector.rb
index 0180df9..599fb47 100644
--- a/lib/prometheus_exporter/server/sidekiq_process_collector.rb
+++ b/lib/prometheus_exporter/server/sidekiq_process_collector.rb
@@ -21,8 +21,15 @@ module PrometheusExporter::Server
     end

     def metrics
+      clear_stale_metrics
+
+      SIDEKIQ_PROCESS_GAUGES.map do |name, help|
+        gauges[name]&.reset!
+      end
+
       sidekiq_metrics.map do |metric|
         labels = metric.fetch('labels', {})
+
         SIDEKIQ_PROCESS_GAUGES.map do |name, help|
           if (value = metric[name])
             gauge = gauges[name] ||= PrometheusExporter::Metric::Gauge.new("sidekiq_process_#{name}", help)
@@ -37,10 +44,20 @@ module PrometheusExporter::Server
     def collect(object)
       now = ::Process.clock_gettime(::Process::CLOCK_MONOTONIC)
       process = object['process']
-
       process["created_at"] = now
-      sidekiq_metrics.delete_if { |metric| metric['created_at'] + MAX_SIDEKIQ_METRIC_AGE < now }
+
+      clear_stale_metrics(now)
       sidekiq_metrics << process
     end
+
+    private
+
+    def clear_stale_metrics(now_ts = nil)
+      now_ts ||= ::Process.clock_gettime(::Process::CLOCK_MONOTONIC)
+
+      sidekiq_metrics.delete_if do |metric|
+        metric['created_at'] + MAX_SIDEKIQ_METRIC_AGE < now_ts
+      end
+    end
   end
 end

While the stale data is being reported for the proc stats, i did notice same happens with ruby_sidekiq_stats_* metrics, i dont have any sidekiq procs running but still get these values:

ruby_sidekiq_stats_processes_size 1
ruby_sidekiq_stats_workers_size 5

I'll try to cut a new PR with bugfixes for these sidekiq metrics tomorrow.

SamSaffron commented 1 year ago

oh awesome, thanks @sosedoff

discourse / prometheus_exporter

Sidekiq process metrics are not being cleared out #261