deadtrickster / prometheus.erl

Prometheus.io client in Erlang
MIT License
341 stars 117 forks source link

Crashes with quantile_summary metrics and multiple schedulers #159

Open mikpe opened 11 months ago

mikpe commented 11 months ago

We're getting persistent crashes when emitting quantile_summary metrics collected from multiple schedulers:

** exception error: no match of right hand side value true
     in function  quantile_estimator:merge/2 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 141)
     in call from quantile_estimator:compress/4 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 133)
     in call from quantile_estimator:compress/4 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 133)
     in call from quantile_estimator:compress/1 (/tmp/prometheus.erl/_build/default/lib/quantile_estimator/src/quantile_estimator.erl, line 106)
     in call from prometheus_quantile_summary:'-values/2-fun-0-'/3 (/tmp/prometheus.erl/src/metrics/prometheus_quantile_summary.erl, line 309)
     in call from lists:foldl_1/3 (lists.erl, line 1355)
     in call from prometheus_quantile_summary:values/2 (/tmp/prometheus.erl/src/metrics/prometheus_quantile_summary.erl, line 306)

There appears to be two contributing causes:

  1. quantile_estimator:compress/1 and its helper function merge/2 want to assert that the input is as expected (i.e., as quantile_estimator itself would have produced), but the assertion on line 141 fails.
  2. prometheus_quantile_summary records observations from different schedulers under different keys in its ETS table (to reduce contention). Then it wants to combine observations with quantile_merge/2, which just appends the observations from the different schedulers (the ++ on line 497), before passing that to quantile_estimator:compress/1.

The problem, as far as I understand this code, is that the append in quantile_merge/2 is not guaranteed to result in data that quantile_estimator:merge/2 considers to be well-formed.

I'm attaching a test case (zipped due to github limitations). It's based on the actual contents of the prometheus_quantile_summary_table ETS from one of our crashes, reduced to the bare minimum that still reproduced the crash. (I've tried to come up with a test case that only used the public APIs, but failed, possibly due to the non-determinism described below.) bug.erl.zip

Note: quantile_summary_metrics.erl contains two non-deterministic constructs (retrieving multiple entries from a set ETS table) and I found it necessary to eliminate that non-determinism in order to have a reliable test case:

diff --git a/src/metrics/prometheus_quantile_summary.erl b/src/metrics/prometheus_quantile_summary.erl
index 9ffdd3e..e860bb0 100644
--- a/src/metrics/prometheus_quantile_summary.erl
+++ b/src/metrics/prometheus_quantile_summary.erl
@@ -290,7 +290,7 @@ value(Registry, Name, LabelValues) ->
                             [],
                             ['$$']}]) of
     [] -> undefined;
-    Values -> {Count, Sum, QE} = reduce_values(Values),
+    Values -> {Count, Sum, QE} = reduce_values(lists:sort(Values)),
               {Count,  prometheus_time:maybe_convert_to_du(DU, Sum), quantile_values(QE, QNs)}
   end.

@@ -403,7 +403,7 @@ insert_metric(Registry, Name, LabelValues, Value, ConflictCB) ->
   end.

 load_all_values(Registry, Name) ->
-  ets:match(?TABLE, {{Registry, Name, '$1', '_'}, '$2', '$3', '$4', '_'}).
+  lists:sort(ets:match(?TABLE, {{Registry, Name, '$1', '_'}, '$2', '$3', '$4', '_'})).

 get_configuration(Registry, Name) ->
   MF = prometheus_metric:check_mf_exists(?TABLE, Registry, Name),

0001-eliminate-non-deterministic-behaviour.patch.txt

This is with prometheus.erl v4.10.0, any recent OTP (24.3.4.14, 25.3.2.7, 26.1.2), and Linux/x86_64 (AL2, Fedora 38).

mikpe commented 11 months ago

The crash reported in https://github.com/deadtrickster/prometheus.erl/issues/146 looks surprisingly similar to the one we get.

chrzaszcz commented 7 months ago

I can confirm the issue. The merging logic is flawed, because, just as @mikpe suggested, concatenating data from two quantile summaries does not result in a valid quantile summary because of repeated ranks, wrong deltas, etc. There is also an issue that the ETS operations are not atomic, and with more than ?LENGTH (by default 16) schedulers there might be race conditions.

As a result prometheus_quantile_summary is currently unusable if there is more than one scheduler. A way to fix this might be to:

  1. Either implement proper summary merging logic, or just insert everything row by row on merge (starting with the largest summary).
  2. Also make sure that ?LENGTH is not smaller than erlang:system_info(schedulers_online).
lhoguin commented 7 months ago

Please open a PR with a test and fix and I will see if I can get Ilya or someone else to review and merge.

mikpe commented 7 months ago

If I had a fix I would have submitted it by now.

Our workaround was to ban any usage of prometheus.erl's quantile_summary metrics. For now we use histograms where we need that sort of thing, but they're not ideal. We have an internal ticket to reimplement quantile summaries from scratch, if and when we have to have them, but that hasn't happened yet.