Open SimonKoetting opened 10 months ago
Pinging @elastic/es-data-management (Team:Data Management)
In short this is indeed caused by rounding, specifically that rounding and arithmetic do not play nice together.
Within the ingest pipeline, cumulative time values are stored as nanoseconds. When they are output to /_nodes/stats
they are converted to millisecond using java's TimeUnit.NANOS.toMillis. This is just integer division by 1M, so it rounds down to the nearest millisecond.
Even though inner pipeline millis increase monotonically, and the sum of inner pipeline millis is less than or equal to the outer pipelines millis, due to rounding, the formula outer - sum(inner)
will not increase monotonically.
This is easiest to see with an example. Consider an outer pipeline with a single inner pipeline. If the outer pipeline starts with a actual nanosecond counter value equal to 10.1ms
and the inner with an actual value of 9.9ms
, these will be rounded to 10ms
and 9ms
respectively. If the self time of the outer pipeline is computed by subtracting inner millis from outer millis, the computed value will be 1ms
, though the actual value is 0.2ms
. Now we process a new document - the outer takes 0.3ms
, the inner 0.2ms
, thus the outer self time is 0.1ms
. After the new document the actual outer value is 10.4ms
, the inner actual is 10.1ms
, thus the outer self actual is 0.3ms
. But the rounded outer and inner values are now both 10ms
, causing the computed outer self value to be 0ms
.
Solutions:
self_time_millis
. Update beats to the use already computed self_time_millis
directly rather than computing it.My inclination is to do 2) as nanoseconds won't be useful to most users.
In either case, there will still be a bit of weirdness caused by rounding: it will no longer be true that sum(inner) + outerSelf == outer
.
After some additional discussion and thought, I do not know if tracking self_time_millis
in the ingest pipeline is the best way to achieve this, due to:
In either case, there will still be a bit of weirdness caused by rounding: it will no longer be true that sum(inner) + outerSelf == outer.
If we emit self_time_millis
as described above, in the above example we would start with outer time being 10ms
, inner time being 9ms
, and outer self time being 0ms
. It seems strange to have inner time and outer self time not actually account for all the time in outer time.
The way it is currently computed does not have this problem. The currently computed value of self time makes more sense if thought of as: "the number of whole milliseconds not taken by an inner pipeline". Then it makes sense why it can decrease sometimes: some inner pipeline increased to another whole millisecond without increasing the total by a whole millisecond.
Though this may make sense, it does not fix the need for a metric which is non-decreasing as an argument to counter_rate
. @SimonKoetting is there any way to deal with this metric assuming that it can sometimes decrease? Perhaps with a different visualization?
Hey @parkertimmins , I'm not aware of any other way on how to achive a histogram visualization without using counter_rate as if we would just do something like totalms/doccount the graph would become nearly linear after a while the pipeline is running :D But as disclaimer, I'm not a developer, so maybe some Kibana-Dev have an idea on how to fix that on a "frontend"-site
Hey @SimonKoetting Thanks for giving it some thought. I came up with something that will behave similar to counter_rate
but for a metric such as this that can occasionally decrease in value.
@slobodanadamovic Since you last changed this metric in https://github.com/elastic/cloud/pull/124989, I'd like to see what you think. For context, my preceding comment explains why we'd prefer to fix this in the dashboard if possible, rather than emitting a new stat from ES.
We could replace
cumulative_sum(counter_rate(max(elasticsearch.ingest_pipeline.total.time.self.ms)))/cumulative_sum(counter_rate((max(elasticsearch.ingest_pipeline.total.count))))
with:
cumulative_sum(pick_max(0, differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)))) / cumulative_sum(counter_rate((max(elasticsearch.ingest_pipeline.total.count))))
Or more specifically, replacing this clause: counter_rate(max(elasticsearch.ingest_pipeline.total.time.self.ms))
with: pick_max(0, differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)))
.
Here's some example data comparing these metrics: | data set | values |
---|---|---|
Input Stat | 5 6 7 6 7 1 | |
Metric w/ counter rate | 5 1 1 6 1 1 | |
Metric w/ differences | 5 1 1 0 1 0 |
When there is a decrease in the input stat value, this will output 0 instead of the value of the stat after the decrease, as is done by counter_rate
. This can be seen where there are 0's in the bottom row.
Unfortunately, this will has a small bug. If the large decrease between stat values 7 and 1 are due to a node restart, 1 should be the output rather than 0. But node restarts are relatively rare, so I do not think this is a big problem. An alternative way to deal with this is to also add an ifelse to the denominator and thus not include the counts in the average computation when there is a decrease:
cumulative_sum(pick_max(0, differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)))) / cumulative_sum(ifelse(differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)) < 0, 0,counter_rate((max(elasticsearch.ingest_pipeline.total.count)))))
What do y'all think?
Seems like an easy, even if not perfect, improvement/fix for this issue. Especially including the if-else should lead to pretty accurate numbers in here, we then just should update all Visualizations on the Dashboard to use this "new" way of calculation.
+1 from my Side
Sounds good from my side. Thanks for looking into this. I will take time to update our monitoring dashboards with the proposal and let you know in case I have any followup questions.
Elasticsearch Version
8.11.3
Installed Plugins
No response
Java Version
bundled
OS Version
Cloud
Problem Description
Problem Description
By looking at the integrated [Elasticsearch] Ingest Pipeline Dashboard, I multiple times (meaning on multiple different clusters) found "slowest pipelines" with > 100ms (seemed definitely too much to me) who's also not matching the Average time per doc line chart on the same Dashboard. Some Examples (different Clusters):
As sometimes if I change the time range just slightly, the numbers totally change (1000x sometimes) it is clearly there is something wrong with the numbers at the Slowest pipelines Table-Visualization on the prebuilt dashboard
My Analysis so far
The Table calculates its metric with the following formula:
Breaking this a bit down to identify the problem, I landed up by just using the counter_rate of time.self.ms to clearly see some spikes: Note: this screenshot and all following are on a single node cluster, so don't have to worry about different stats from different nodes, but I verified the same behavior on a cluster having multiple nodes doing ingest.
Looking at the documentation of the counter_rate:
"If the value does get smaller, it will interpret this as a counter reset." is the interesting part in here, if we take a look at the data in Discover for the last spike in the Screenshot above we can see the metric "elasticsearch.ingest_pipeline.total.time.self.ms" decreased by a ms. As the metrics is meant to be Time spent processing documents through this pipeline, this isn't possible as new documents going thru the pipeline the time has to increase. Also in the screenshot visible, is the "elasticsearch.ingest_pipeline.total.time.total.ms" kept the same value. If I changed the lens visualisation from above to use the "elasticsearch.ingest_pipeline.total.time.total.ms", so far I couldn't find any "spike"/weird looking data.
Found out, the calculation of the "elasticsearch.ingest_pipeline.total.time.self.ms" is basically just "elasticsearch.ingest_pipeline.total.time.total.ms" - "elasticsearch.ingest_pipeline.processor.time.total.ms" of "pipeline" processors in the pipeline The corresponding code can be found on Github around line 96 and line 128
Looking again at the pipeline stats in discover, but this time for the individual processors, not the whole pipeline at the same date range discovers the "elasticsearch.ingest_pipeline.processor.time.total.ms" of the pipeline processor increased by 1ms in the timeframe while the "elasticsearch.ingest_pipeline.total.time.total.ms" from the screenshot above keep the same (530ms in this case)
My conclusion so far
I guess there will be some kind of weird (maybe rounding/converting?) problem around the collection of the metrics "elasticsearch.ingest_pipeline.total.time.total.ms" and "elasticsearch.ingest_pipeline.processor.time.total.ms". Based on that, all the other behavior like a decreasing "elasticsearch.ingest_pipeline.total.time.self.ms" and therefor the Lens visualisations counter_rate detects a reset and providing "wrong" data on the Dashboard are deterministic and make sense to me.
Used Setup
I saw this problem/behavior on multiple different clusters at customers. To reproduce/test, I used a simple 1-Instance Elastic-Cloud Deployment with a few enrolled fleet Agents without any special configuration or huge amounts of data. Used multiple Elasticsearch Versions, including 8.11.3
Steps to Reproduce
Setup up a cluster, enable Stack Monitoring and wait :)
Logs (if relevant)
No response