avito-tech / bioyino

High performance and high-precision multithreaded StatsD server
The Unlicense
228 stars 22 forks source link

Sampling rate support #53

Closed Karsonito closed 3 years ago

Karsonito commented 3 years ago

Hello! If I understand correctly, the sampling rate is not supported like in other statsd implementations. Bioyino does not "restore" the original value ​​in proportion to the sample rate? For example, after sending metric:1|c|@0.01 I expect to receive metric 100

Albibek commented 3 years ago

Hi. The format is supported. But the value restoration is missing from the code. I think this happened because we are not unsing sampling in our systems.

It is still should be considered a bug and have to be fixed.

Albibek commented 3 years ago

I've pushed the fix into 0.8.0-alpha1 tag It contains only sampling fix applied onto latest stable v0.7.0, so alpha should not scare you, but the fix itself will land in 0.8.0 stable with other changes.

The current fix is quick, and may bring some precision loss due to historical usage of float32 for sampling along with float64 in all other calculations. This should not be an issue because a much bigger loss is assumed when you do the sampling itself. But anyways the better and bigger change is going to be introduced in 0.8.0.

@Karsonito, could be great if you tested the change with your requirements and your environment.

Karsonito commented 3 years ago

Sorry for delay. Thanks for improvement! All works correctly. The only problem is https://github.com/avito-tech/bioyino/issues/57 Please merge fix in 0.8 branch

Albibek commented 3 years ago

Pushed

Karsonito commented 3 years ago

It's a pity I didn't notice that timing aggregates is broken when rate < 1

Compare results for statsd and bioyino:

for i in 10 20 30 40 50 60 70 80 90 100; do echo "timer:$i|ms|@0.1" | nc -u 127.0.0.1 9235; done

statsd.metrics_received.count 10 1616174319
timer.count 100 1616174319
timer.sum 550 1616174319
timer.lower 10 1616174319
timer.mean 55 1616174319
timer.upper 100 1616174319
timer.upper_50 50 1616174319
timer.upper_95 100 1616174319
timer.upper_99 100 1616174319

for i in 10 20 30 40 50 60 70 80 90 100; do echo "timer:$i|ms|@0.1" | nc -u 127.0.0.1 8825; done

bioyino.ingress-metric; 10 1616174410
timer;aggregate=count 10 1616174410
timer;aggregate=sum 5500 1616174410
timer;aggregate=min 100 1616174410
timer;aggregate=mean 550 1616174410
timer;aggregate=max 1000 1616174410
timer;aggregate=percentile.50 550 1616174410
timer;aggregate=percentile.95 955 1616174410
timer;aggregate=percentile.99 991 1616174410

All bioyino timing aggregates is not correctly restored.

Albibek commented 3 years ago

Actually, statsd seems to be ignoring the rate totally in your case. All we do with the rate at the moment is dividing the original value by it. This was copied from brubeck, but original docs in statsd don't say much about what it relly means and how the value whould be changed.
What behaviour do you actually expect here for rate > 1 and < 1? Does it have to depend on metric type?

Karsonito commented 3 years ago

Thanks for attention.

"Actually, statsd seems to be ignoring the rate totally in your case." In which aggregate statsd ignoring rate? count is restored to 100 as it should Or you mean sum which also should be restored up to 5500 to be consistent with mean x count?

All other aggregates seems correct in my opinion

I tested the brubeck and it gave the same result as statsd:

for i in 10 20 30 40 50 60 70 80 90 100; do echo "timer:$i|ms|@0.1" | nc -u 127.0.0.1 8127; done

timer.count 100 1616192248
timer.min 10 1616192248
timer.max 100 1616192248
timer.sum 550 1616192248
timer.mean 55 1616192248
timer.median 50 1616192248
timer.percentile.75 80 1616192248
timer.percentile.95 100 1616192248
timer.percentile.98 100 1616192248
timer.percentile.99 100 1616192248
brubeck.metrics 10 1616192248

"What behaviour do you actually expect here for rate > 1 and < 1? Does it have to depend on metric type?" Rate > 1 is not useful. Brubeck ignores such metrics, though statsd uses them. If rate < 1 only counters and timers count should be restored for the sake of correctness. IMHO )

Albibek commented 3 years ago

This seems not totally correct to me. And I would prefer the intuitive and correct algorithm here rather than blindly pursuing the compatilbility.

Let me try to model a sampling example.

Imagine rate = 0.5 and the original series like this:

3 ? 1 ? 5 ? 2 ? 6

The ? are values we don't know because we are sampling. They could be estimated to something close to known values, like an average between two. Or even easier - the previous value. So the estimated, "restored" series would look like this

3 3 1 1 5 5 2 2 6

Now with this approach:

  1. For the counter-typed metrics their values should be multiplied to the rate.
  2. For gauges and sets we will need another behaviour - ignore the sampling rate, only adding count aggregate properly.
  3. For the timer-typed metrics the correct aggregation should be different and really more complicated. Our aggregation is using the full series for counting aggregates, without estimation in between. So for i.e. percentiles to be counted correctly we would have to insert 1/rate number values into the accumulating vector. The size of a such vector will therefore grow proportionally to the rate value. So for rate 0.0001 the vector will instantly become 10k values for a single incoming metric.

And we also have to consider at least one edge case, when metrics with different rates come at the same time. Nothing stops you from doing like this:

timer:10|ms|@0.1
timer:10|ms
timer:10|ms|@0.02

It seems, that the only real problem here is the situation with the timer storage. I think it should be solved by interpreting the sampled timers in another way than unsampled and by storing the special kind of vector processed by a separate formulas on aggregation. Such vectors will obviously take more CPU spent on aggregation of a such metric. Not totally sure, but at first sight for percentiles it will become O(N/2) instead of O(1) compared to unsampled timers.

This also will mean that sampled and unsampled timers will not be mixable together inside one aggregation period. So for the example case above only a fist metric will be considered, and others will be errors because of sampling rate mismatch.

While it would take some time to program, I think this will give the correctness we would want here. The price is the results will not probably be the same as with the other statsd solutions.

What do you think?

Karsonito commented 3 years ago

"And I would prefer the intuitive and correct algorithm here rather than blindly pursuing the compatilbility." I agree with you. If we have the opportunity, it is desirable to restore the data according to the probability theory. And do not pay attention to the original and other statsd implementations.

timer:10|ms|@0.1
timer:10|ms
timer:10|ms|@0.02

I don't see practical sense in using multi rates in one aggregation period. It's enough to use values only for one rate (for example, the largest rate) and ignore the rest.

I don't know how you calculating all aggregations. Do you have to restore all 100% of values to make correct calculations? Or it's possible to use only sampled values and use multiplication by 1/rate for count and sum.

Restoring all values is not friendly for memory and cpu. Especially without reasonable goal )

Albibek commented 3 years ago

What about rate aggregate? I mean not a sample rate, but rate, i.e. metrics per second.

For a series of 300 values sampled at sample_rate = 0.1 for 30 seconds you want to have metric1.rate = 10 or metric1.rate = 1?

Karsonito commented 3 years ago

rate = sum / interval = 300 / 30 = 10

Albibek commented 3 years ago

So you want the "system" rate, the one which is received by bio, not the original one, i.e. the rate on the "source", which in my example would be sum / interval / sampling_rate = 300 /30 / 10 = 1 right?

Karsonito commented 3 years ago

It's not a question of wish ) I think that all aggregates must be correct between each other.

For example: Real events count inside app: 300 Sample rate: 0.1 Sent\received count: 30 Interval: 30 Aggregates: sum = 30/0.1 = 300, rate = sum / 30 = 10

And such logic of restoring original values should be used in all other cases.

Only one exception is bioyino inner stats. It must expose counts without any restoring.

Do you agree with me?

Karsonito commented 3 years ago

So you want the "system" rate, the one which is received by bio, not the original one, i.e. the rate on the "source", which in my example would be sum / interval / sampling_rate = 300 /30 / 10 = 1 right?

We need original rate. Which is 10 in current example. Btw we can speak russian to minimize misunderstanding )

Albibek commented 3 years ago

Btw we can speak russian to minimize misunderstanding )

Not here please. I'd like issues to be world-readable just in case. Any other chat - Telegram or something - you're welcome.

We need original rate. Which is 10 in current example.

I may have miscalculated this, but it is what I meant. You want the original rate inside the real application, not the rate at which bioyino receives them. I've got it anyways, thought the same thing here.

Karsonito commented 3 years ago

We need original rate. Which is 10 in current example.

I may have miscalculated this, but it is what I meant. You want the original rate inside the real application, not the rate at which bioyino receives them. I've got it anyways, thought the same thing here.

Excellent! Thank you for your time! Waiting for implementing

Albibek commented 3 years ago

Implemented the correct sampling rate counting along the way to the 0.8.0. Not sure it will go to 0.7, because of lots changes to be done (storing sampling rate in memory was removed in 0.7, but seemed to be required again, so I had to return it back with a non-trivial bunch of changes).

Good news: this tag currently runs in our production well, but yet only at cluster servers, not agents. Could be great if you tried the sampling rate counting together with other features. Please make sure to read the CHANGELOG, since it has many changes incompatible with 0.7.

Karsonito commented 3 years ago

Thanks for your work! There are some problems with new version:

1. Not all values ​​are received if the multimessage is disabled

Only part of values (from a third to half) is handled and aggregated.

for i in {1..100}; do echo "test.counter:1|c" | nc -u 127.0.0.1 8825; done
bioyino.ingress-metric; 51 1617572688
test.counter; 51 1617572688

for i in {1..100}; do echo "test.timer:$i|ms" | nc -u 127.0.0.1 8825; done
bioyino.ingress-metric; 46 1617573028
test.timer;aggregate=percentile.999 99.955 1617573028
test.timer;aggregate=max 100 1617573028
test.timer;aggregate=percentile.75 78.75 1617573028
test.timer;aggregate=min 1 1617573028
test.timer;aggregate=percentile.98 99.1 1617573028
test.timer;aggregate=median 48.5 1617573028
test.timer;aggregate=count 46 1617573028
test.timer;aggregate=last 99 1617573028
test.timer;aggregate=sum 2432 1617573028
test.timer;aggregate=mean 52.869565217391308 1617573028
test.timer;aggregate=rate 9.2 1617573028
test.timer;aggregate=percentile.99 99.55 1617573028
test.timer;aggregate=percentile.95 97.75 1617573028

Config options:

multimessage = false
buffer-flush-time = 100
buffer-flush-length = 0

All other meaningful settings are set by default from https://raw.githubusercontent.com/avito-tech/bioyino/v0.8.0/config.toml See attached bioyino-0.8.0.toml.txt

I had tried to push stucked packets in buffer with periodic metric every buffer-flush-time, but without success. I never saw such behavior before.

2. For timers last aggregation is not correct and takes different values:

for i in {1..100}; do echo "test.timer:$i|ms" | nc -u 127.0.0.1 8825; done
test.timer;aggregate=count 100 1617573493
test.timer;aggregate=percentile.75 75.25 1617573493
test.timer;aggregate=percentile.95 95.05 1617573493
test.timer;aggregate=last 92 1617573493
test.timer;aggregate=rate 20 1617573493
test.timer;aggregate=percentile.999 99.901 1617573493
test.timer;aggregate=percentile.99 99.01 1617573493
test.timer;aggregate=sum 5050 1617573493
test.timer;aggregate=percentile.98 98.02 1617573493
test.timer;aggregate=median 50.5 1617573493
test.timer;aggregate=max 100 1617573493
test.timer;aggregate=mean 50.5 1617573493
test.timer;aggregate=min 1 1617573493
...
for i in {1..100}; do echo "test.timer:$i|ms" | nc -u 127.0.0.1 8825; done
test.timer;aggregate=last 91 1617573643

3. Restored values for counter and timers (count, sum and rate) has too high precision.

echo "test.counter:1|c|@0.1" | nc -u 127.0.0.1 8825
test.counter; 9.99999985098839 1617575824

echo "test.timer:1|ms|@0.1" | nc -u 127.0.0.1 8825
test.timer;aggregate=count 9.99999985098839 1617575969
test.timer;aggregate=rate 1.999999970197678 1617575969
test.timer;aggregate=sum 9.99999985098839 1617575969

Is it possible to decrease it for restored values? I think it doesn't has any meaning but uses more space in TSDB with delta encoding compression.

Albibek commented 3 years ago
  1. I'll look into it, will let you know later. As of now, just for the sake of testing you can try a trick with "pushing" the rest of the metrics with another set. For example, send 1000 more metrics with another name.
  2. The last aggregate incorrectness is a result of multithreading aggregation. Like gauge, there is no way to determine which exact value have come last, but it will most probably be one of the last ones. Also notice, that even with single-threaded aggregation it still not possible to determine this because of, for example, network packet reordering.
  3. The result you see is not a precision problem, it is a result of counting with float64 values and float32 sampling converted to float64. If you try to count this by hands in any language with floats, like Javascript or C, the result will be the same. For the totally same metric values on input it should be the same, because floats will always be counted like this. I cannot actually see, how TSDB would suffer there. Obviously they store them as 64-bit or 32-bit values, so one metrics is always 8 or 4 bytes, not depenging on number of characters. Anyways, in the real world you almost always have a little bit different results each aggregation period, so TSDB would much more probably store the diff, than just increase the RLE counter.
Karsonito commented 3 years ago

I'll look into it, will let you know later. As of now, just for the sake of testing you can try a trick with "pushing" the rest of the metrics with another set. For example, send 1000 more metrics with another name.

I tried to push with another metric. It's not helping

Albibek commented 3 years ago

@Karsonito The UDP problem seems to be fixed, should work as intended. Please try the latest tag

Karsonito commented 3 years ago

Hello! Finally tested last fix No errors found. Under heavy load (500k pockets/s, 300k unique metrics) also works correctly. Thank you for great work!

Karsonito commented 3 years ago

I cannot actually see, how TSDB would suffer there. Obviously they store them as 64-bit or 32-bit values, so one metrics is always 8 or 4 bytes, not depenging on number of characters. Anyways, in the real world you almost always have a little bit different results each aggregation period, so TSDB would much more probably store the diff, than just increase the RLE counter.

We are using VictoriaMetrics TSDB which uses a little modified gorilla compression https://faun.pub/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932

I made an analysis of compression and found, that every 3 digits of precision doubles size of record in VM. In monitoring it's useless. It's not a finance tool

image image

That's why it's important to have ability to decrease precision for restored values.

Albibek commented 3 years ago

Actually it still may be useful in some edge cases, where counting is done in very low values. Anyways, the problem above(with 32-bit sampling) does not bring precision loss and does not happen because of it.

What I can offer is using 32-bit floats instead of 64-bit. I've pushed a new tag where you can try them.

To enable this feature you need to use the f32 feature flag when compiling:

cargo build --release --features f32

We've never tried to run bioyino with f32's, so I cannot guarantee it will all be correct. It should help you with the precision because f32 only have 6 meaningful digits instead of 15. WIll be glad hearing from you how it went and if it is counting things correctly.

Karsonito commented 3 years ago

Thanks for quick solution But it's not solving the problem image

I create new issue with request for rounding options and close current. Are you agree?

Albibek commented 3 years ago

Are you sure you've compiled this with the f32 flag? https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fad3d9d5e6d24cc5596df441ff374bcf

Karsonito commented 3 years ago

Yes, f32 is enabled: cargo build --release --features f32 --verbose

I see --cfg 'feature="f32"' added in rustc commands in log

I made all possible cleanings before build (I'm newbie in rust)

rm -rf ~/.cargo
cargo clean
cargo update

It doesn't helps:

test.timer;aggregate=count 3.3333332538604738

But it differs from alpha5 without f32:

test.timer;aggregate=count 3.3333332008785727
Albibek commented 3 years ago

I think I know where the problem is. I'll try to make it less digits correctly, will let you know when it's ready.

Albibek commented 3 years ago

Actually, that was fast. I've changed float serializer to a more correct one. Try the 0.8.0-alpha6 tag. I have these for your input now

test.timer;aggregate=min 1.0 1620928444
test.timer;aggregate=percentile.95 1.0 1620928444
test.timer;aggregate=percentile.98 1.0 1620928444
test.timer;aggregate=max 1.0 1620928444
test.timer;aggregate=percentile.999 1.0 1620928444
test.timer;aggregate=mean 1.0 1620928444
test.timer;aggregate=percentile.99 1.0 1620928444
test.timer;aggregate=last 1.0 1620928444
test.timer;aggregate=count 3.3333333 1620928444
test.timer;aggregate=median 1.0 1620928444
test.timer;aggregate=sum 3.3333333 1620928444
test.timer;aggregate=rate 0.6666666 1620928444
test.timer;aggregate=percentile.75 1.0 1620928444
Karsonito commented 3 years ago

Thanks. Now it's great

Karsonito commented 3 years ago

Please merge last fixes in 0.8.0 branch

Albibek commented 3 years ago

These tags were 0.8.0 branch already, so no need to.

Karsonito commented 3 years ago

Maybe I misunderstood, but I do not see the last commit in v0.8.0 branch https://github.com/avito-tech/bioyino/commits/v0.8.0

commit 7a22907461f65fa9c5f15afe2b610cc047d4e1fd
Author: Sergey Noskov <snoskov@avito.ru>
Date:   Thu May 13 20:54:53 2021 +0300

    Change serializer for floats to more correct one
Albibek commented 3 years ago

Did not push it somehow. Not it's there

Karsonito commented 3 years ago

FYI, I tested resource consumption for 0.6.0, 0.7.2 and 0.8.0 with counters and timers. For timers 0.8.0 has lowest rss memory peaks (-50%) during aggregation for timer metrics and spends less cpu time (-10-15%) For counters results are more complicated and there are no obvious advantages or issues. Thanks for work one more time!

bioyino_usage_timers_0 999 bioyino_usage_timers_1 0 bioyino_usage_counters_0 999 bioyino_usage_counters_1 0