Is TYPE correct for cb_bucket_ep_oom_errors and cb_bucket_ep_tmp_oom_errors ?

mozai commented 5 years ago

I have 558 couchbase nodes. and over the past week I've only ever seen these metrics increase and plateau, never decrease, even on machines that presently have 85% ram free. The couchbase_exporter documents them as

# HELP cb_bucket_ep_oom_errors Number of times unrecoverable OOMs happened while processing operations
# TYPE cb_bucket_ep_oom_errors gauge

I found DataDog documents these metrics as type 'gauge' but I worry these are actually of type counter, because the numbers stay high long after the machine's memory pressure is relieved. I'm having a terrible time finding mention of "Samples.EpOomErrors" or "Samples.EpTmpOomErrors" in the Couchbase documentation. All I've found is a passing mention to "ep_oom_errors" and how it's a bad thing if you see it at all... and about a dozen other websites that copy-paste that one paragraph.

I'm certain the exporter is correctly relaying the information from couchbase, but I would like an assurance that these metrics are of type gauge, and if so, a more comprehensible description of their meaning. I.E. if these are a gague-measurement of errorsful operations... how many operations were sampled for this gauge? a minute's worth? an hour's worth? if I scrape less often than the sample range, could errors go undetected between scrapes?

mozai commented 5 years ago

A coworker just told me to try cbstats and this is what I see:

/opt/couchbase/bin/cbstats localhost all -b iris_groups -u prometheus -p $trustno1 |grep oom
 ep_oom_errors:                                         0
 ep_tmp_oom_errors:                                     0
 ep_warmup_oom:                                         0

and when I ask the exporter on the same machine

wget -qO- localhost:19191/metrics |grep cb_bucket_ep_oom_errors
# HELP cb_bucket_ep_oom_errors Number of times unrecoverable OOMs happened while processing operations
# TYPE cb_bucket_ep_oom_errors gauge
cb_bucket_ep_oom_errors{bucket="default"} 0
cb_bucket_ep_oom_errors{bucket="pupil"} 0
cb_bucket_ep_oom_errors{bucket="sclera"} 0
cb_bucket_ep_oom_errors{bucket="iris_events"} 0
cb_bucket_ep_oom_errors{bucket="iris_groups"} 4.456261e+06

I really think the metric the exporter gets is a counter, and the cbstats tool is somehow displaying the increase instead of the absolute value.

blakelead commented 5 years ago

Hello @mozai,

Thanks a lot for reporting this. I'll investigate it asap.

blakelead commented 5 years ago

To be sure that Couchbase API returns the correct values, could you please run the following command:

curl http://<user>:<password>@<couchbase_hostname>/pools/default/buckets/iris_groups/stats | jq .op.samples.ep_oom_errors

It should return an array of the last values for ep_oom_errors.

mozai commented 5 years ago

DOCSCRATCH=$(mktemp ./tmp.XXXXXX.json)
curl -gs 'http://prometheus:${redacted}@localhost:8091/pools/default/buckets/iris_groups/status' >$DOCSCRATCH
jq -c <$DOCSCRATCH .op.samples.ep_oom_errors
[4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261,4456261
4456261,4456261,4456261,4456261,4456261,4456261]

I get an array of 60 identical numbers. I hope this doesn't mean I have four million errors per second persistently over the last minute.

mozai commented 5 years ago

Aie, I should've mentioned: I'm using the community edition of couchbase, v5.1.1

blakelead commented 5 years ago

After some digging, the metric indicates that 4,456,261 OOM errors were reported since boot time. So it is sort of a counter metric but I can't use it as such because it can be reinitialized to zero when the machine reboots. As for the output of the cbstats command, I don't really know what it means... No OOM errors at the time of the execution of the command?

And to answer some of your earlier questions:

Here, bucket stats metrics are all Gauge metrics
Gauges in the exporter always take the last value given by the Couchbase API. I don't do average or min/max per some period: it means that the exporter can miss things -> if it is an issue for you, I can add this to my backlog and see what I can do
The exporter scrapes metrics when requested, and it exposes the last value of the ep_oom_errors array. The array contains, by default, values from the last 60 seconds.

mozai commented 5 years ago

4,456,261 errors were reported since boot time

Then it's a counter, it's counting how many events happened, not how many are here right now. It's like counting how many cars passed a lamppost since you arrived instead of counting how many cars are under the lamppost when asked. If it were a gauge, then it would be like saying "there are four hundred cars under this lamppost" when there isn't a single car in sight, because you saw four hundred cars since you arrived earlier that day.

From https://prometheus.io/docs/concepts/metric_types/: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

I can't use it as such because it can be reinitialized when the machine reboots.

Every counter is like that. node_exporter does not save the current value of node_network_transmit_bytes when it senses a machine will reboot. snmp daemons do not have a state-file for the current value of their OIDs. I understand your concern that a prometheus rate() function might assume the counter overflowed if it was reset to zero by the daemon resetting. I don't know how prometheus is able not to make the wrong assumption when other counters are reset by the machine rebooting. I would cautiously test it.

In the meantime, I can't use this metric as a gauge metric because my couchbase machines do not currently four million out-of-memory errors every time prometheus asks them. It may have had errors since I turned it on, but it doesn't have errors now, so cb_bucket_ep_oom_errors > 0 as an alert is a false positive, and graphing cb_bucket_ep_oom_errors over time is misleading at best.

blakelead commented 5 years ago

Hello @mozai, It makes sense I'll work on it.

blakelead commented 4 years ago

Hello @mozai,

Sorry it took so long. I pushed a work in progress in the develop branch. You can now define metric type in metrics json files. I made some metrics counters by default (like cb_bucket_ep_oom_errors).

I should be able to test it soon in a real large cluster

blakelead / couchbase_exporter

Is TYPE correct for cb_bucket_ep_oom_errors and cb_bucket_ep_tmp_oom_errors ? #41