discourse / prometheus_exporter

A framework for collecting and aggregating prometheus metrics
MIT License
538 stars 156 forks source link

active_record_connection_pool_waiting always returns 0 #242

Open bogn83 opened 2 years ago

bogn83 commented 2 years ago

The query for it always returns 0 even though we can see lots of

ActiveRecord::ConnectionTimeoutError
could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use

even fully unfiltered by labels and in a 30d time window:

ruby_active_record_connection_pool_waiting > 0

while

ruby_active_record_connection_pool_idle > 0

works.

slipperywizard commented 2 years ago

+1

modulitos commented 2 years ago

I believe this metric is working correctly.

I had the same issue as the OP, and suspect it wasn't working as well. But digging deeper, I realized that this metric isn't likely to get scraped by Prometheus unless the system is under sustained load. Since Prometheus only scrapes metrics once in a while (default is 10 seconds), it's possible to get a short spike in load, raise an ActiveRecord::ConnectionTimeoutError, and have ruby_active_record_connection_pool_waiting return to 0 when scraped. So it's a race condition. For this metric to reliably report a value above 0, the system must be under load long enough for the threads waiting in the queue to get scraped. After running a load test, I can confirm that it incremented correctly.

We can see how this works in the code. This metric is set by calling the .stat method on the connection pool: https://github.com/discourse/prometheus_exporter/blob/8c8f4a58268c7b7ca3b80b4564ca62997c191624/lib/prometheus_exporter/instrumentation/active_record.rb#L60

and here's the .stat method: https://github.com/rails/rails/blob/eed417637b594b33b3b58b08dd41daa10c57e083/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L703-L719 (Referenced AR version is 6.0.5.1)

we can see that the waiting metric is set to num_waiting_in_queue. Tracing through that code, we can see that it's a gauge for the number of threads waiting on a connection, which gets incremented/decremented in this loop: https://github.com/rails/rails/blob/eed417637b594b33b3b58b08dd41daa10c57e083/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L203-L226

The above code snippet should convince us that the number of waiting threads is incremented and reported to the waiting metric, and that it's incremented before the ConnectionTimeoutError error is raised. This all happens within fractions of a second, so unless there's a sustained load, it might not get scraped by Prometheus.