`ValueAtPercentile()` 4.5X on-cpu time optimization: remove expensive condition checks and re-use computation on hotpaths

Summary of optimizations

Benchmark	Baseline ns/op	Comparison ns/op	% Improvement	Improvement factor
BenchmarkHistogramValueAtPercentile-8	31896	7569	76.27%	4.2
BenchmarkHistogramValueAtPercentileGivenPercentileSlice-8	159894	33900	78.80%	4.7

Detail of analysis/changes

Looking at the baseline CPU time by function in the following manner:

go test -bench=BenchmarkHistogramValueAtPercentile -test.cpuprofile=BenchmarkHistogramValueAtPercentile-Baseline.txt -test.benchtime=10s
pprof -web BenchmarkHistogramValueAtPercentile-Baseline.txt

We can observe that the iterator nextCountAtIdx() is the responsible for the majority of the CPU time. ( Even after improving the percentile calculation on the latest release as showcased in #46 ).

Doing the same analysis by line of code as follow:

go test -bench=BenchmarkHistogramValueAtPercentile -test.cpuprofile=BenchmarkHistogramValueAtPercentile-Baseline.txt -test.benchtime=10s
pprof -web -lines BenchmarkHistogramValueAtPercentile-Baseline.txt

We can observe that the top consuming LOC are:

condition on hdr.go#L626 taking ~11% of cpu-time: if i.countToIdx >= i.h.totalCount . Notice that we're doing a more restrictive check at hdr.go#L340 if total >= countAtPercentile {, meaning we can completely avoid this condition check.
condition on hdr.go#L631 taking ~11% of cpu-time: if i.subBucketIdx >= i.h.subBucketCount {
condition on hdr.go#L636 taking ~7% of cpu-time: if i.bucketIdx >= i.h.bucketCount { . Given at max ( percentile 100 ) we will be at the limit of bucketCount we can completely avoid this duplicate check.
return of getCountAtIdx on hdr.go#L643 taking ~19% of cpu-time: return true . Notice that the function is not inlined. We can move away from up to O(N+M) calls to getCountAtIdx to O(1) call of the new optimized method that we've introduced named getValueFromIdxUpToCount.

Looking further at hotspots we can also check that getCountAtIndex is also a good candidate for optimization (on hdr.go#L640 takes 13% CPU time).

Even though we can't remove this call, we can reduce the amount of duplicate computation within it -- specifically on the inner calls to the calculation of bucketBaseIdx that don't change during the time we iterate on each bucket sub-buckets. With that in mind, we've introduced getCountAtIndexGivenBucketBaseIdxand only calculate the bucketBaseIdx on the iteration that change bucket ( meaning no wasted computation on sub-bucket flows ).

Impact of the above optimizations

Following up on all the we've moved from a baseline of:

(base) fco@fcos-Air hdrhistogram-go % go test -bench=BenchmarkHistogramValueAtPercentile -test.benchtime=10s
goos: darwin
goarch: arm64
pkg: github.com/HdrHistogram/hdrhistogram-go
BenchmarkHistogramValueAtPercentile-8                             352322             31896 ns/op               0 B/op          0 allocs/op
BenchmarkHistogramValueAtPercentileGivenPercentileSlice-8          83769            159894 ns/op               0 B/op          0 allocs/op
BenchmarkHistogramValueAtPercentilesGivenPercentileSlice-8        244653             53997 ns/op             248 B/op          4 allocs/op
PASS
ok      github.com/HdrHistogram/hdrhistogram-go 40.829s

to the new optimized ValueAtPercentile / ValueAtPercentileGivenPercentileSlice:

(base) fco@fcos-Air hdrhistogram-go % go test -bench=BenchmarkHistogramValueAtPercentile  -test.benchtime=10s 
goos: darwin
goarch: arm64
pkg: github.com/HdrHistogram/hdrhistogram-go
BenchmarkHistogramValueAtPercentile-8                            1671085              7569 ns/op               0 B/op          0 allocs/op
BenchmarkHistogramValueAtPercentileGivenPercentileSlice-8         335668             33900 ns/op               0 B/op          0 allocs/op
BenchmarkHistogramValueAtPercentilesGivenPercentileSlice-8        229405             51147 ns/op             248 B/op          4 allocs/op
PASS
ok      github.com/HdrHistogram/hdrhistogram-go 45.063s

HdrHistogram / hdrhistogram-go