discourse / prometheus_exporter

A framework for collecting and aggregating prometheus metrics
MIT License
532 stars 154 forks source link

Fix: puma thread/memory leak, add thread stop logic to puma instrumentation #218

Closed zhm closed 2 years ago

zhm commented 2 years ago

The README for puma recommends using this code to initialize puma instrumentation:

after_worker_boot do
  require 'prometheus_exporter/instrumentation'

  PrometheusExporter::Instrumentation::Puma.start
end

This results in a puma monitor thread being spawned per worker from the main process. When combined with puma-worker-killer this results in an accumulation of leaked threads and memory. The problematic one is PrometheusExporter::Instrumentation::Puma.start, which unconditionally spawns a new thread.

This PR adds the same stop / kill logic from the ActiveRecord and Process instrumentation to the Puma instrumentation.

When we add this fix:

memory

SamSaffron commented 2 years ago

Yikes ... cc @nateberkopec

Let's start at the beginning here though. What is the correct way of spawning this thread only once per process? Starting and stopping a thread 30 times does not feel ideal.

At a minimum ... perhaps ... if the thread is already started we don't spawn a new one ... don't stop the old one?

nateberkopec commented 2 years ago

Yeah, as we say in the comments for after_worker_fork (alias of after_worker_boot), emphasis mine:

Code to run in the master after a worker has been started This is called everytime a worker is to be started.

So code run in this hook should be idempotent. I'm not sure how this gem works but if you can run in the workers rather than the master process, you could hook in to on_worker_boot(&block) instead.

SamSaffron commented 2 years ago

I see... @zhm how does this look?

https://github.com/discourse/prometheus_exporter/pull/219

SamSaffron commented 2 years ago

@zhm closing this in favor of #219 . It applies the fix more widely.

Thank you so much for bringing this up!

zhm commented 2 years ago

@SamSaffron this is fantastic, #219 is a much more comprehensive solution!