Open mozai opened 5 years ago
A coworker just told me to try cbstats
and this is what I see:
/opt/couchbase/bin/cbstats localhost all -b iris_groups -u prometheus -p $trustno1 |grep oom
ep_oom_errors: 0
ep_tmp_oom_errors: 0
ep_warmup_oom: 0
and when I ask the exporter on the same machine
wget -qO- localhost:19191/metrics |grep cb_bucket_ep_oom_errors
# HELP cb_bucket_ep_oom_errors Number of times unrecoverable OOMs happened while processing operations
# TYPE cb_bucket_ep_oom_errors gauge
cb_bucket_ep_oom_errors{bucket="default"} 0
cb_bucket_ep_oom_errors{bucket="pupil"} 0
cb_bucket_ep_oom_errors{bucket="sclera"} 0
cb_bucket_ep_oom_errors{bucket="iris_events"} 0
cb_bucket_ep_oom_errors{bucket="iris_groups"} 4.456261e+06
I really think the metric the exporter gets is a counter, and the cbstats
tool is somehow displaying the increase instead of the absolute value.
Hello @mozai,
Thanks a lot for reporting this. I'll investigate it asap.
To be sure that Couchbase API returns the correct values, could you please run the following command:
curl http://<user>:<password>@<couchbase_hostname>/pools/default/buckets/iris_groups/stats | jq .op.samples.ep_oom_errors
It should return an array of the last values for ep_oom_errors
.
DOCSCRATCH=$(mktemp ./tmp.XXXXXX.json)
curl -gs 'http://prometheus:${redacted}@localhost:8091/pools/default/buckets/iris_groups/status' >$DOCSCRATCH
jq -c <$DOCSCRATCH .op.samples.ep_oom_errors
[4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261]
I get an array of 60 identical numbers. I hope this doesn't mean I have four million errors per second persistently over the last minute.
Aie, I should've mentioned: I'm using the community edition of couchbase, v5.1.1
After some digging, the metric indicates that 4,456,261 OOM errors were reported since boot time. So it is sort of a counter metric but I can't use it as such because it can be reinitialized to zero when the machine reboots.
As for the output of the cbstats
command, I don't really know what it means... No OOM errors at the time of the execution of the command?
And to answer some of your earlier questions:
ep_oom_errors
array. The array contains, by default, values from the last 60 seconds.4,456,261 errors were reported since boot time
Then it's a counter, it's counting how many events happened, not how many are here right now. It's like counting how many cars passed a lamppost since you arrived instead of counting how many cars are under the lamppost when asked. If it were a gauge, then it would be like saying "there are four hundred cars under this lamppost" when there isn't a single car in sight, because you saw four hundred cars since you arrived earlier that day.
From https://prometheus.io/docs/concepts/metric_types/: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
I can't use it as such because it can be reinitialized when the machine reboots.
Every counter is like that. node_exporter does not save the current value of node_network_transmit_bytes
when it senses a machine will reboot. snmp daemons do not have a state-file for the current value of their OIDs. I understand your concern that a prometheus rate() function might assume the counter overflowed if it was reset to zero by the daemon resetting. I don't know how prometheus is able not to make the wrong assumption when other counters are reset by the machine rebooting. I would cautiously test it.
In the meantime, I can't use this metric as a gauge metric because my couchbase machines do not currently four million out-of-memory errors every time prometheus asks them. It may have had errors since I turned it on, but it doesn't have errors now, so cb_bucket_ep_oom_errors > 0
as an alert is a false positive, and graphing cb_bucket_ep_oom_errors
over time is misleading at best.
Hello @mozai, It makes sense I'll work on it.
Hello @mozai,
Sorry it took so long.
I pushed a work in progress in the develop branch. You can now define metric type in metrics json files. I made some metrics counters by default (like cb_bucket_ep_oom_errors
).
I should be able to test it soon in a real large cluster
I have 558 couchbase nodes. and over the past week I've only ever seen these metrics increase and plateau, never decrease, even on machines that presently have 85% ram free. The couchbase_exporter documents them as
I found DataDog documents these metrics as type 'gauge' but I worry these are actually of type counter, because the numbers stay high long after the machine's memory pressure is relieved. I'm having a terrible time finding mention of "Samples.EpOomErrors" or "Samples.EpTmpOomErrors" in the Couchbase documentation. All I've found is a passing mention to "ep_oom_errors" and how it's a bad thing if you see it at all... and about a dozen other websites that copy-paste that one paragraph.
I'm certain the exporter is correctly relaying the information from couchbase, but I would like an assurance that these metrics are of type gauge, and if so, a more comprehensible description of their meaning. I.E. if these are a gague-measurement of errorsful operations... how many operations were sampled for this gauge? a minute's worth? an hour's worth? if I scrape less often than the sample range, could errors go undetected between scrapes?