discourse / prometheus_exporter

A framework for collecting and aggregating prometheus metrics
MIT License
532 stars 154 forks source link

FIX: stop raising NoMethodError when processing unregistered types #209

Open hx opened 2 years ago

hx commented 2 years ago

We've found a lot of log noise comes from PrometheusExporter::Server::Collector#process_hash whenever #register_metric_unsafe returns nil. We recognise there are two problems there;

  1. Our collectors aren't registering properly. We're working on this.
  2. There's dissonance between the two aforementioned functions, with one not being prepared for the other to return nil.

This PR suggests a remedy for the latter.

It also raises the broader question of how this scenario could be better handled, since we've somewhat clumsily managed to demonstrate that it can occur in production code. A quick search shows 5 usages of STDERR.puts, which would ideally be handleable somehow. Some ideas would be;

I'm happy to discuss options and draft alternate PRs if this one isn't the way we want to go. Someone else in my organisation did our Prometheus implementations, so I'm pretty unfamiliar with its ecosystem and with this gem, and I'm happy to be corrected if I've missed anything.

Thanks everyone!

SamSaffron commented 2 years ago

I feel like an exception though is correct, you asked the method to do something and it was impossible for it to register the metric.

I also agree the exception you get back is not helpful at all, perhaps the first change here is to re-wrap the exception and raise something more helpful?

hx commented 2 years ago

I also agree the exception you get back is not helpful at all, perhaps the first change here is to re-wrap the exception and raise something more helpful?

The real question here is context. The exception occurs in a server handler thread, in a call stack that doesn't allow gem consumers to implement an exception handler. The gem owns the server thread, so the gem should be handling the exception. Which means the error handling mechanism is a function of the desired behaviour.

So what's the desired behaviour? I don't think crashing the server thread would be helpful. We've found these errors only occur in production, so it doesn't fit the typical pattern of letting the unhandled exceptions fail a test suite. It's also a breaking change that I imagine few would appreciate.

I suggest delegating error handling to consumers. It's a little more flexible than letting them set a logger, because the default could maintain the existing behaviour. In the abstract:

class Server
  def initialize
    @on_error = -> e { STDERR.puts e.message }
  end

  def on_error(&block)
    @on_error = block
  end
end

…

server = Server.new(…)
server.on_error { |e| Rails.logger.warn "Prometheus exporter error: #{e.message}" }

Internally, this could be implemented with a simple generic exception class (e.g. PrometheusExporter::Server::Error) that's rescued in the mount_proc block and calls @on_error.call(e).

Let me know if you'd like to explore this approach, and I'll rewrite this PR. When we're happy with the design, I'll sort out whatever's going on with Ruby 3.

Thanks @SamSaffron !

SamSaffron commented 2 years ago

I see, yes totally, happy with the change as long as you add some interface for the consumers to get at the exceptions. Please go ahead and explore the approach, also with the PR I would love it if you can update the readme to reflect the change!