Open mikpe opened 11 months ago
The crash reported in https://github.com/deadtrickster/prometheus.erl/issues/146 looks surprisingly similar to the one we get.
I can confirm the issue. The merging logic is flawed, because, just as @mikpe suggested, concatenating data from two quantile summaries does not result in a valid quantile summary because of repeated ranks, wrong deltas, etc. There is also an issue that the ETS operations are not atomic, and with more than ?LENGTH
(by default 16) schedulers there might be race conditions.
As a result prometheus_quantile_summary
is currently unusable if there is more than one scheduler. A way to fix this might be to:
?LENGTH
is not smaller than erlang:system_info(schedulers_online)
.Please open a PR with a test and fix and I will see if I can get Ilya or someone else to review and merge.
If I had a fix I would have submitted it by now.
Our workaround was to ban any usage of prometheus.erl
's quantile_summary
metrics. For now we use histograms where we need that sort of thing, but they're not ideal. We have an internal ticket to reimplement quantile summaries from scratch, if and when we have to have them, but that hasn't happened yet.
We're getting persistent crashes when emitting
quantile_summary
metrics collected from multiple schedulers:There appears to be two contributing causes:
quantile_estimator:compress/1
and its helper functionmerge/2
want to assert that the input is as expected (i.e., asquantile_estimator
itself would have produced), but the assertion on line 141 fails.prometheus_quantile_summary
records observations from different schedulers under different keys in its ETS table (to reduce contention). Then it wants to combine observations withquantile_merge/2
, which just appends the observations from the different schedulers (the++
on line 497), before passing that toquantile_estimator:compress/1
.The problem, as far as I understand this code, is that the append in
quantile_merge/2
is not guaranteed to result in data thatquantile_estimator:merge/2
considers to be well-formed.I'm attaching a test case (zipped due to github limitations). It's based on the actual contents of the
prometheus_quantile_summary_table
ETS from one of our crashes, reduced to the bare minimum that still reproduced the crash. (I've tried to come up with a test case that only used the public APIs, but failed, possibly due to the non-determinism described below.) bug.erl.zipNote:
quantile_summary_metrics.erl
contains two non-deterministic constructs (retrieving multiple entries from aset
ETS table) and I found it necessary to eliminate that non-determinism in order to have a reliable test case:0001-eliminate-non-deterministic-behaviour.patch.txt
This is with
prometheus.erl
v4.10.0
, any recent OTP (24.3.4.14, 25.3.2.7, 26.1.2), and Linux/x86_64 (AL2, Fedora 38).