Closed Karsonito closed 3 years ago
Hi. The format is supported. But the value restoration is missing from the code. I think this happened because we are not unsing sampling in our systems.
It is still should be considered a bug and have to be fixed.
I've pushed the fix into 0.8.0-alpha1 tag It contains only sampling fix applied onto latest stable v0.7.0, so alpha should not scare you, but the fix itself will land in 0.8.0 stable with other changes.
The current fix is quick, and may bring some precision loss due to historical usage of float32 for sampling along with float64 in all other calculations. This should not be an issue because a much bigger loss is assumed when you do the sampling itself. But anyways the better and bigger change is going to be introduced in 0.8.0.
@Karsonito, could be great if you tested the change with your requirements and your environment.
Sorry for delay. Thanks for improvement! All works correctly. The only problem is https://github.com/avito-tech/bioyino/issues/57 Please merge fix in 0.8 branch
Pushed
It's a pity I didn't notice that timing aggregates is broken when rate < 1
Compare results for statsd and bioyino:
for i in 10 20 30 40 50 60 70 80 90 100; do echo "timer:$i|ms|@0.1" | nc -u 127.0.0.1 9235; done
statsd.metrics_received.count 10 1616174319
timer.count 100 1616174319
timer.sum 550 1616174319
timer.lower 10 1616174319
timer.mean 55 1616174319
timer.upper 100 1616174319
timer.upper_50 50 1616174319
timer.upper_95 100 1616174319
timer.upper_99 100 1616174319
for i in 10 20 30 40 50 60 70 80 90 100; do echo "timer:$i|ms|@0.1" | nc -u 127.0.0.1 8825; done
bioyino.ingress-metric; 10 1616174410
timer;aggregate=count 10 1616174410
timer;aggregate=sum 5500 1616174410
timer;aggregate=min 100 1616174410
timer;aggregate=mean 550 1616174410
timer;aggregate=max 1000 1616174410
timer;aggregate=percentile.50 550 1616174410
timer;aggregate=percentile.95 955 1616174410
timer;aggregate=percentile.99 991 1616174410
All bioyino timing aggregates is not correctly restored.
Actually, statsd seems to be ignoring the rate totally in your case. All we do with the rate at the moment is dividing the original value by it. This was copied from brubeck, but original docs in statsd don't say much about what it relly means and how the value whould be changed.
What behaviour do you actually expect here for rate > 1 and < 1? Does it have to depend on metric type?
Thanks for attention.
"Actually, statsd seems to be ignoring the rate totally in your case."
In which aggregate statsd ignoring rate? count
is restored to 100 as it should
Or you mean sum
which also should be restored up to 5500 to be consistent with mean
x count
?
All other aggregates seems correct in my opinion
I tested the brubeck and it gave the same result as statsd:
for i in 10 20 30 40 50 60 70 80 90 100; do echo "timer:$i|ms|@0.1" | nc -u 127.0.0.1 8127; done
timer.count 100 1616192248
timer.min 10 1616192248
timer.max 100 1616192248
timer.sum 550 1616192248
timer.mean 55 1616192248
timer.median 50 1616192248
timer.percentile.75 80 1616192248
timer.percentile.95 100 1616192248
timer.percentile.98 100 1616192248
timer.percentile.99 100 1616192248
brubeck.metrics 10 1616192248
"What behaviour do you actually expect here for rate > 1 and < 1? Does it have to depend on metric type?"
Rate > 1 is not useful. Brubeck ignores such metrics, though statsd uses them.
If rate < 1 only counters and timers count
should be restored for the sake of correctness.
IMHO )
This seems not totally correct to me. And I would prefer the intuitive and correct algorithm here rather than blindly pursuing the compatilbility.
Let me try to model a sampling example.
Imagine rate = 0.5 and the original series like this:
3 ? 1 ? 5 ? 2 ? 6
The ?
are values we don't know because we are sampling. They could be estimated to something close to known values, like an average between two. Or even easier - the previous value. So the estimated, "restored" series would look like this
3 3 1 1 5 5 2 2 6
Now with this approach:
And we also have to consider at least one edge case, when metrics with different rates come at the same time. Nothing stops you from doing like this:
timer:10|ms|@0.1
timer:10|ms
timer:10|ms|@0.02
It seems, that the only real problem here is the situation with the timer storage. I think it should be solved by interpreting the sampled timers in another way than unsampled and by storing the special kind of vector processed by a separate formulas on aggregation. Such vectors will obviously take more CPU spent on aggregation of a such metric. Not totally sure, but at first sight for percentiles it will become O(N/2) instead of O(1) compared to unsampled timers.
This also will mean that sampled and unsampled timers will not be mixable together inside one aggregation period. So for the example case above only a fist metric will be considered, and others will be errors because of sampling rate mismatch.
While it would take some time to program, I think this will give the correctness we would want here. The price is the results will not probably be the same as with the other statsd solutions.
What do you think?
"And I would prefer the intuitive and correct algorithm here rather than blindly pursuing the compatilbility." I agree with you. If we have the opportunity, it is desirable to restore the data according to the probability theory. And do not pay attention to the original and other statsd implementations.
timer:10|ms|@0.1
timer:10|ms
timer:10|ms|@0.02
I don't see practical sense in using multi rates in one aggregation period. It's enough to use values only for one rate (for example, the largest rate) and ignore the rest.
I don't know how you calculating all aggregations. Do you have to restore all 100% of values to make correct calculations?
Or it's possible to use only sampled values and use multiplication by 1/rate for count
and sum
.
Restoring all values is not friendly for memory and cpu. Especially without reasonable goal )
What about rate aggregate? I mean not a sample rate, but rate, i.e. metrics per second.
For a series of 300 values sampled at sample_rate = 0.1 for 30 seconds you want to have metric1.rate = 10
or metric1.rate = 1
?
rate = sum / interval = 300 / 30 = 10
So you want the "system" rate, the one which is received by bio, not the original one, i.e. the rate on the "source", which in my example would be
sum / interval / sampling_rate = 300 /30 / 10 = 1
right?
It's not a question of wish ) I think that all aggregates must be correct between each other.
For example: Real events count inside app: 300 Sample rate: 0.1 Sent\received count: 30 Interval: 30 Aggregates: sum = 30/0.1 = 300, rate = sum / 30 = 10
And such logic of restoring original values should be used in all other cases.
Only one exception is bioyino inner stats. It must expose counts without any restoring.
Do you agree with me?
So you want the "system" rate, the one which is received by bio, not the original one, i.e. the rate on the "source", which in my example would be
sum / interval / sampling_rate = 300 /30 / 10 = 1
right?
We need original rate. Which is 10 in current example. Btw we can speak russian to minimize misunderstanding )
Btw we can speak russian to minimize misunderstanding )
Not here please. I'd like issues to be world-readable just in case. Any other chat - Telegram or something - you're welcome.
We need original rate. Which is 10 in current example.
I may have miscalculated this, but it is what I meant. You want the original rate inside the real application, not the rate at which bioyino receives them. I've got it anyways, thought the same thing here.
We need original rate. Which is 10 in current example.
I may have miscalculated this, but it is what I meant. You want the original rate inside the real application, not the rate at which bioyino receives them. I've got it anyways, thought the same thing here.
Excellent! Thank you for your time! Waiting for implementing
Implemented the correct sampling rate counting along the way to the 0.8.0. Not sure it will go to 0.7, because of lots changes to be done (storing sampling rate in memory was removed in 0.7, but seemed to be required again, so I had to return it back with a non-trivial bunch of changes).
Good news: this tag currently runs in our production well, but yet only at cluster servers, not agents. Could be great if you tried the sampling rate counting together with other features. Please make sure to read the CHANGELOG, since it has many changes incompatible with 0.7.
Thanks for your work! There are some problems with new version:
multimessage
is disabledOnly part of values (from a third to half) is handled and aggregated.
for i in {1..100}; do echo "test.counter:1|c" | nc -u 127.0.0.1 8825; done
bioyino.ingress-metric; 51 1617572688
test.counter; 51 1617572688
for i in {1..100}; do echo "test.timer:$i|ms" | nc -u 127.0.0.1 8825; done
bioyino.ingress-metric; 46 1617573028
test.timer;aggregate=percentile.999 99.955 1617573028
test.timer;aggregate=max 100 1617573028
test.timer;aggregate=percentile.75 78.75 1617573028
test.timer;aggregate=min 1 1617573028
test.timer;aggregate=percentile.98 99.1 1617573028
test.timer;aggregate=median 48.5 1617573028
test.timer;aggregate=count 46 1617573028
test.timer;aggregate=last 99 1617573028
test.timer;aggregate=sum 2432 1617573028
test.timer;aggregate=mean 52.869565217391308 1617573028
test.timer;aggregate=rate 9.2 1617573028
test.timer;aggregate=percentile.99 99.55 1617573028
test.timer;aggregate=percentile.95 97.75 1617573028
Config options:
multimessage = false
buffer-flush-time = 100
buffer-flush-length = 0
All other meaningful settings are set by default from https://raw.githubusercontent.com/avito-tech/bioyino/v0.8.0/config.toml See attached bioyino-0.8.0.toml.txt
I had tried to push stucked packets in buffer with periodic metric every buffer-flush-time
, but without success.
I never saw such behavior before.
last
aggregation is not correct and takes different values:for i in {1..100}; do echo "test.timer:$i|ms" | nc -u 127.0.0.1 8825; done
test.timer;aggregate=count 100 1617573493
test.timer;aggregate=percentile.75 75.25 1617573493
test.timer;aggregate=percentile.95 95.05 1617573493
test.timer;aggregate=last 92 1617573493
test.timer;aggregate=rate 20 1617573493
test.timer;aggregate=percentile.999 99.901 1617573493
test.timer;aggregate=percentile.99 99.01 1617573493
test.timer;aggregate=sum 5050 1617573493
test.timer;aggregate=percentile.98 98.02 1617573493
test.timer;aggregate=median 50.5 1617573493
test.timer;aggregate=max 100 1617573493
test.timer;aggregate=mean 50.5 1617573493
test.timer;aggregate=min 1 1617573493
...
for i in {1..100}; do echo "test.timer:$i|ms" | nc -u 127.0.0.1 8825; done
test.timer;aggregate=last 91 1617573643
count
, sum
and rate
) has too high precision.echo "test.counter:1|c|@0.1" | nc -u 127.0.0.1 8825
test.counter; 9.99999985098839 1617575824
echo "test.timer:1|ms|@0.1" | nc -u 127.0.0.1 8825
test.timer;aggregate=count 9.99999985098839 1617575969
test.timer;aggregate=rate 1.999999970197678 1617575969
test.timer;aggregate=sum 9.99999985098839 1617575969
Is it possible to decrease it for restored values? I think it doesn't has any meaning but uses more space in TSDB with delta encoding compression.
last
aggregate incorrectness is a result of multithreading aggregation. Like gauge, there is no way to determine which exact value have come last, but it will most probably be one of the last ones. Also notice, that even with single-threaded aggregation it still not possible to determine this because of, for example, network packet reordering.I'll look into it, will let you know later. As of now, just for the sake of testing you can try a trick with "pushing" the rest of the metrics with another set. For example, send 1000 more metrics with another name.
I tried to push with another metric. It's not helping
@Karsonito The UDP problem seems to be fixed, should work as intended. Please try the latest tag
Hello! Finally tested last fix No errors found. Under heavy load (500k pockets/s, 300k unique metrics) also works correctly. Thank you for great work!
I cannot actually see, how TSDB would suffer there. Obviously they store them as 64-bit or 32-bit values, so one metrics is always 8 or 4 bytes, not depenging on number of characters. Anyways, in the real world you almost always have a little bit different results each aggregation period, so TSDB would much more probably store the diff, than just increase the RLE counter.
We are using VictoriaMetrics TSDB which uses a little modified gorilla compression https://faun.pub/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932
I made an analysis of compression and found, that every 3 digits of precision doubles size of record in VM. In monitoring it's useless. It's not a finance tool
That's why it's important to have ability to decrease precision for restored values.
Actually it still may be useful in some edge cases, where counting is done in very low values. Anyways, the problem above(with 32-bit sampling) does not bring precision loss and does not happen because of it.
What I can offer is using 32-bit floats instead of 64-bit. I've pushed a new tag where you can try them.
To enable this feature you need to use the f32
feature flag when compiling:
cargo build --release --features f32
We've never tried to run bioyino with f32's, so I cannot guarantee it will all be correct. It should help you with the precision because f32 only have 6 meaningful digits instead of 15. WIll be glad hearing from you how it went and if it is counting things correctly.
Thanks for quick solution But it's not solving the problem
I create new issue with request for rounding options and close current. Are you agree?
Are you sure you've compiled this with the f32 flag? https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fad3d9d5e6d24cc5596df441ff374bcf
Yes, f32 is enabled:
cargo build --release --features f32 --verbose
I see --cfg 'feature="f32"'
added in rustc commands in log
I made all possible cleanings before build (I'm newbie in rust)
rm -rf ~/.cargo
cargo clean
cargo update
It doesn't helps:
test.timer;aggregate=count 3.3333332538604738
But it differs from alpha5 without f32:
test.timer;aggregate=count 3.3333332008785727
I think I know where the problem is. I'll try to make it less digits correctly, will let you know when it's ready.
Actually, that was fast. I've changed float serializer to a more correct one. Try the 0.8.0-alpha6 tag. I have these for your input now
test.timer;aggregate=min 1.0 1620928444
test.timer;aggregate=percentile.95 1.0 1620928444
test.timer;aggregate=percentile.98 1.0 1620928444
test.timer;aggregate=max 1.0 1620928444
test.timer;aggregate=percentile.999 1.0 1620928444
test.timer;aggregate=mean 1.0 1620928444
test.timer;aggregate=percentile.99 1.0 1620928444
test.timer;aggregate=last 1.0 1620928444
test.timer;aggregate=count 3.3333333 1620928444
test.timer;aggregate=median 1.0 1620928444
test.timer;aggregate=sum 3.3333333 1620928444
test.timer;aggregate=rate 0.6666666 1620928444
test.timer;aggregate=percentile.75 1.0 1620928444
Thanks. Now it's great
Please merge last fixes in 0.8.0 branch
These tags were 0.8.0 branch already, so no need to.
Maybe I misunderstood, but I do not see the last commit in v0.8.0 branch https://github.com/avito-tech/bioyino/commits/v0.8.0
commit 7a22907461f65fa9c5f15afe2b610cc047d4e1fd
Author: Sergey Noskov <snoskov@avito.ru>
Date: Thu May 13 20:54:53 2021 +0300
Change serializer for floats to more correct one
Did not push it somehow. Not it's there
FYI, I tested resource consumption for 0.6.0, 0.7.2 and 0.8.0 with counters and timers. For timers 0.8.0 has lowest rss memory peaks (-50%) during aggregation for timer metrics and spends less cpu time (-10-15%) For counters results are more complicated and there are no obvious advantages or issues. Thanks for work one more time!
Hello! If I understand correctly, the sampling rate is not supported like in other statsd implementations. Bioyino does not "restore" the original value in proportion to the sample rate? For example, after sending
metric:1|c|@0.01
I expect to receivemetric 100