Open hoel-zr-o opened 2 weeks ago
Thank you for the thorough analysis. We take log2l
from bcc-tools:
I think it would make sense to fix the issue there and then backport it, unless you think it's somehow specific to ebpf_exporter.
Well, that's the thing - I don't know I'd say log2l
itself is buggy, per se. It's operating on integers, so truncating towards zero might not be the wrong behavior there - eg. 3 / 2 == 1
isn't surprising, so I don't believe if log2l(3) == 1
is. I might be splitting hairs, though, since buggy or not, the BCC project also uses log2l
this way, so they would also need to adjust log2l
or its callers.
I personally think the right move would be adjust the two log2l
callsites in maps.bpf.h
to add 1 to key.bucket
if log2(increment)
would normally have a fractional component (which increment_exp2zero_histogram
seems to partially do), but I defer to your judgement here!
Hello!
I think I may have found an off-by-one error in the various histogram macros provided by
maps.bpf.h
- the tl;dr is that when I record a value in anexp2
histogram, I see that value reflected in the bucket below or equal to that value, rather than the first bucket above to that value.For example, let's say I have buckets for 1, 2, 4, 8, ..., 1024, and I record the value 268. I would expect the buckets corresponding to
le="5.12e-07"
andle="1.024e-06"
to have their values incremented by one - but I am also seeing the bucket forle="2.56e-07"
get incremented.Minimal Example
Here is a fairly minimal example that demonstrates this by just inserting the values 1-8 into an
exp2
histogram (I usedtp/sched/sched_process_exec
as a proxy for an event that will happen almost immediately upon startup - my BPF knowledge is mostly limited tobpftrace
and I didn't see an analogue to itsBEGIN
block that I could use):...and the output of doing a
curl -s http://localhost:9435/metrics | grep ebpf_exporter_poc_values_bucket
againstsudo ./ebpf_exporter --config.dir=examples --config.names=off_by_one
:By contrast, here's a Go program that creates a Prometheus histogram and observes values 1-8:
...and the resulting output:
Suspected Cause
I think what's happening here is that
ebpf_exporter
is usinglog2l
to determine the bucket index, and since it's performing integer arithmetic,log2l
is discarding the fractional part of the result toward zero, solog(3) / log(2) = 1.5849625007211563
ends up as1
, but it should be2
.