Closed sosedoff closed 1 year ago
certainly looks like a bug to me we are missing a GC and customer collector for the metrics.
@mrsimo any thoughts here?
It looks like the metrics expiration is only checked during the ingest time, per https://github.com/discourse/prometheus_exporter/blob/main/lib/prometheus_exporter/server/sidekiq_process_collector.rb#L42
So per https://github.com/discourse/prometheus_exporter/blob/main/lib/prometheus_exporter/server/sidekiq_process_collector.rb#L24 we can still be serving stale metrics.
After further investigation, it appears that we've got a few issues:
metrics
call in SidekiqProcessCollector
class, which means old data is stuck around until process is restarted.Here's a potential fix for clearing out stale metrics data:
diff --git a/lib/prometheus_exporter/server/sidekiq_process_collector.rb b/lib/prometheus_exporter/server/sidekiq_process_collector.rb
index 0180df9..599fb47 100644
--- a/lib/prometheus_exporter/server/sidekiq_process_collector.rb
+++ b/lib/prometheus_exporter/server/sidekiq_process_collector.rb
@@ -21,8 +21,15 @@ module PrometheusExporter::Server
end
def metrics
+ clear_stale_metrics
+
+ SIDEKIQ_PROCESS_GAUGES.map do |name, help|
+ gauges[name]&.reset!
+ end
+
sidekiq_metrics.map do |metric|
labels = metric.fetch('labels', {})
+
SIDEKIQ_PROCESS_GAUGES.map do |name, help|
if (value = metric[name])
gauge = gauges[name] ||= PrometheusExporter::Metric::Gauge.new("sidekiq_process_#{name}", help)
@@ -37,10 +44,20 @@ module PrometheusExporter::Server
def collect(object)
now = ::Process.clock_gettime(::Process::CLOCK_MONOTONIC)
process = object['process']
-
process["created_at"] = now
- sidekiq_metrics.delete_if { |metric| metric['created_at'] + MAX_SIDEKIQ_METRIC_AGE < now }
+
+ clear_stale_metrics(now)
sidekiq_metrics << process
end
+
+ private
+
+ def clear_stale_metrics(now_ts = nil)
+ now_ts ||= ::Process.clock_gettime(::Process::CLOCK_MONOTONIC)
+
+ sidekiq_metrics.delete_if do |metric|
+ metric['created_at'] + MAX_SIDEKIQ_METRIC_AGE < now_ts
+ end
+ end
end
end
While the stale data is being reported for the proc stats, i did notice same happens with ruby_sidekiq_stats_*
metrics, i dont have any sidekiq procs running but still get these values:
ruby_sidekiq_stats_processes_size 1
ruby_sidekiq_stats_workers_size 5
I'll try to cut a new PR with bugfixes for these sidekiq metrics tomorrow.
oh awesome, thanks @sosedoff
Im running my app in containers and do frequent deploys (cloud platform does not matter). Sidekiq processed are being scaled up and down based on the queue size, which means sidekiq will produce a bunch of metrics with a different
hostname
/identity
tags. I did notice that a bunch of old metrics are still being reported in theprometheus_exporter
metrics endpoint, though im not entirely sure if this is intentional or not.Thought it was something to do with metrics collection in my setup, so i decided to replicate the setup locally:
prometheus_exporter
processrails server
, to get some basic web metrics in, schedule a handful of sidekiq jobssidekiq
, i've used around 5 separate worker processes (each running 5 worker threads)That looks right, all my sidekiq processes are busy. Lets kill all but one, so we can still get sidekiq proc metrics in and simulate a scale down event. Wait a few mins to get metrics updated.
Now, checking the metrics endpoint again i still see the old worker stats, but
ruby_sidekiq_stats_processes_size
is reporting the correct process count1
.Are the stale metrics supposed to be GC'ed? Otherwise it's a bit confusing to see the old values stuck around, especially if such metrics are used to determine scaling events.