akoutmos / prom_ex

An Elixir Prometheus metrics collection library built on top of Telemetry with accompanying Grafana dashboards
MIT License
577 stars 96 forks source link

[BUG] Polling of metrics in custom plugins stops if an error is raised inside the mfa function for the metric #236

Open fedme opened 1 month ago

fedme commented 1 month ago

Describe the bug Polling of metrics in custom plugins stops if an error is raised inside the mfa function for the metric.

To Reproduce Steps to reproduce the behavior:

  1. Clone this example repository: https://github.com/fedme/prom_ex_issue The sample application defines a custom PromEx plugin here: https://github.com/fedme/prom_ex_issue/blob/main/lib/prom_ex_issue/custom_prom_ex_plugin.ex

  2. Start the sample application with mix phx.server and look at the logs in the terminal

  3. Observe the logs showing that the mfa function for the metric is called at every polling interval, you should see the following output:

    
    ######################################################################
    MFA execute_ping_metrics called for the 1 time.
    ######################################################################

###################################################################### MFA execute_ping_metrics called for the 2 time. ######################################################################

[...]


4. The plugin is written so that the mfa function raises an error the 6th time it is polled, you should see something like the following output in the console:

[...]

###################################################################### MFA execute_ping_metrics called for the 4 time. ######################################################################

###################################################################### MFA execute_ping_metrics called for the 5 time. ######################################################################

[error] Error when calling MFA defined by measurement: PromExIssue.CustomPromExPlugin :execute_ping_metrics [#PID<0.676.0>] Class=:error Reason=%RuntimeError{ message: "Something is not working correctly, I can't return the metrics right now!" } Stacktrace=[ {PromExIssue.CustomPromExPlugin, :execute_ping_metrics, 1, [ file: ~c"lib/prom_ex_issue/custom_prom_ex_plugin.ex", line: 48, error_info: %{module: Exception} ]}, {:telemetry_poller, :make_measurement, 1, [ file: ~c"/Users/fedme/code/prom_ex_issue/deps/telemetry_poller/src/telemetry_poller.erl", line: 336 ]}, {:telemetry_poller, :"-make_measurements_and_filter_misbehaving/1-lc$^0/1-0-", 1, [ file: ~c"/Users/fedme/code/prom_ex_issue/deps/telemetry_poller/src/telemetry_poller.erl", line: 332 ]}, {:telemetry_poller, :handle_info, 2, [ file: ~c"/Users/fedme/code/prom_ex_issue/deps/telemetry_poller/src/telemetry_poller.erl", line: 354 ]}, {:gen_server, :try_handle_info, 3, [file: ~c"gen_server.erl", line: 1095]}, {:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 1183]}, {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 241]} ]


5. Notice that the metric is not polled anymore after the exception, no more logs in the console alerting us that the function is being polled.

**Expected behavior**
Even if the metric function raises an error at a certain poll invocation, the polling should not stop and rather keep going so that future values of the metric can be collected after the error (hopefully) goes away.

**Environment**

Erlang/OTP 26 [erts-14.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Elixir 1.16.0 (compiled with Erlang/OTP 26)



**Additional context**
First raised on Slack.
akoutmos commented 1 month ago

Thanks for the detailed issue! I should be able to knock this out over the weekend. I'm currently working on another open source library....so I should be able to tackle this as well given I am in open source mode 😄