Ingest Pipeline stats sometimes reports wrong values, causing integrated Ingest Pipeline Dashboard showing wrong statistics

SimonKoetting commented 10 months ago

Elasticsearch Version

8.11.3

Installed Plugins

No response

Java Version

bundled

OS Version

Cloud

Problem Description

By looking at the integrated [Elasticsearch] Ingest Pipeline Dashboard, I multiple times (meaning on multiple different clusters) found "slowest pipelines" with > 100ms (seemed definitely too much to me) who's also not matching the Average time per doc line chart on the same Dashboard. Some Examples (different Clusters):

Pasted image 20240104104842

Pasted image 20240104104928

As sometimes if I change the time range just slightly, the numbers totally change (1000x sometimes) it is clearly there is something wrong with the numbers at the Slowest pipelines Table-Visualization on the prebuilt dashboard

My Analysis so far

The Table calculates its metric with the following formula:

cumulative_sum(counter_rate(max(elasticsearch.ingest_pipeline.total.time.self.ms)))/cumulative_sum(counter_rate((max(elasticsearch.ingest_pipeline.total.count))))

Breaking this a bit down to identify the problem, I landed up by just using the counter_rate of time.self.ms to clearly see some spikes: Pasted image 20240104105521 Note: this screenshot and all following are on a single node cluster, so don't have to worry about different stats from different nodes, but I verified the same behavior on a cluster having multiple nodes doing ingest.

Looking at the documentation of the counter_rate:

Calculates the rate of an ever increasing counter. This function will only yield helpful results on counter metric fields which contain a measurement of some kind monotonically growing over time. If the value does get smaller, it will interpret this as a counter reset. To get most precise results, counter_rate should be calculated on the max of a field.

"If the value does get smaller, it will interpret this as a counter reset." is the interesting part in here, if we take a look at the data in Discover for the last spike in the Screenshot above we can see the metric "elasticsearch.ingest_pipeline.total.time.self.ms" decreased by a ms. Pasted image 20240104110140 As the metrics is meant to be Time spent processing documents through this pipeline, this isn't possible as new documents going thru the pipeline the time has to increase. Also in the screenshot visible, is the "elasticsearch.ingest_pipeline.total.time.total.ms" kept the same value. If I changed the lens visualisation from above to use the "elasticsearch.ingest_pipeline.total.time.total.ms", so far I couldn't find any "spike"/weird looking data.

Found out, the calculation of the "elasticsearch.ingest_pipeline.total.time.self.ms" is basically just "elasticsearch.ingest_pipeline.total.time.total.ms" - "elasticsearch.ingest_pipeline.processor.time.total.ms" of "pipeline" processors in the pipeline The corresponding code can be found on Github around line 96 and line 128

Looking again at the pipeline stats in discover, but this time for the individual processors, not the whole pipeline at the same date range discovers the "elasticsearch.ingest_pipeline.processor.time.total.ms" of the pipeline processor increased by 1ms in the timeframe while the "elasticsearch.ingest_pipeline.total.time.total.ms" from the screenshot above keep the same (530ms in this case) Pasted image 20240104111653

My conclusion so far

I guess there will be some kind of weird (maybe rounding/converting?) problem around the collection of the metrics "elasticsearch.ingest_pipeline.total.time.total.ms" and "elasticsearch.ingest_pipeline.processor.time.total.ms". Based on that, all the other behavior like a decreasing "elasticsearch.ingest_pipeline.total.time.self.ms" and therefor the Lens visualisations counter_rate detects a reset and providing "wrong" data on the Dashboard are deterministic and make sense to me.

Used Setup

I saw this problem/behavior on multiple different clusters at customers. To reproduce/test, I used a simple 1-Instance Elastic-Cloud Deployment with a few enrolled fleet Agents without any special configuration or huge amounts of data. Used multiple Elasticsearch Versions, including 8.11.3

Steps to Reproduce

Setup up a cluster, enable Stack Monitoring and wait :)

Logs (if relevant)

No response

elasticsearchmachine commented 10 months ago

Pinging @elastic/es-data-management (Team:Data Management)

parkertimmins commented 6 months ago

In short this is indeed caused by rounding, specifically that rounding and arithmetic do not play nice together.

Within the ingest pipeline, cumulative time values are stored as nanoseconds. When they are output to /_nodes/stats they are converted to millisecond using java's TimeUnit.NANOS.toMillis. This is just integer division by 1M, so it rounds down to the nearest millisecond.

Even though inner pipeline millis increase monotonically, and the sum of inner pipeline millis is less than or equal to the outer pipelines millis, due to rounding, the formula outer - sum(inner) will not increase monotonically.

This is easiest to see with an example. Consider an outer pipeline with a single inner pipeline. If the outer pipeline starts with a actual nanosecond counter value equal to 10.1ms and the inner with an actual value of 9.9ms, these will be rounded to 10ms and 9ms respectively. If the self time of the outer pipeline is computed by subtracting inner millis from outer millis, the computed value will be 1ms, though the actual value is 0.2ms. Now we process a new document - the outer takes 0.3ms, the inner 0.2ms, thus the outer self time is 0.1ms. After the new document the actual outer value is 10.4ms, the inner actual is 10.1ms, thus the outer self actual is 0.3ms. But the rounded outer and inner values are now both 10ms, causing the computed outer self value to be 0ms.

Solutions:

Emit nanoseconds directly for pipeline time in node stats api. Update beats to compute self time from nanos and do the rounding.
Compute self time from the nano second counter, round this to millis, and emit it in node stats as self_time_millis. Update beats to the use already computed self_time_millis directly rather than computing it.

My inclination is to do 2) as nanoseconds won't be useful to most users. In either case, there will still be a bit of weirdness caused by rounding: it will no longer be true that sum(inner) + outerSelf == outer.

parkertimmins commented 6 months ago

After some additional discussion and thought, I do not know if tracking self_time_millis in the ingest pipeline is the best way to achieve this, due to:

In either case, there will still be a bit of weirdness caused by rounding: it will no longer be true that sum(inner) + outerSelf == outer.

If we emit self_time_millis as described above, in the above example we would start with outer time being 10ms, inner time being 9ms, and outer self time being 0ms. It seems strange to have inner time and outer self time not actually account for all the time in outer time.

The way it is currently computed does not have this problem. The currently computed value of self time makes more sense if thought of as: "the number of whole milliseconds not taken by an inner pipeline". Then it makes sense why it can decrease sometimes: some inner pipeline increased to another whole millisecond without increasing the total by a whole millisecond.

Though this may make sense, it does not fix the need for a metric which is non-decreasing as an argument to counter_rate. @SimonKoetting is there any way to deal with this metric assuming that it can sometimes decrease? Perhaps with a different visualization?

SimonKoetting commented 6 months ago

Hey @parkertimmins , I'm not aware of any other way on how to achive a histogram visualization without using counter_rate as if we would just do something like totalms/doccount the graph would become nearly linear after a while the pipeline is running :D But as disclaimer, I'm not a developer, so maybe some Kibana-Dev have an idea on how to fix that on a "frontend"-site

parkertimmins commented 6 months ago

Hey @SimonKoetting Thanks for giving it some thought. I came up with something that will behave similar to counter_rate but for a metric such as this that can occasionally decrease in value.

@slobodanadamovic Since you last changed this metric in https://github.com/elastic/cloud/pull/124989, I'd like to see what you think. For context, my preceding comment explains why we'd prefer to fix this in the dashboard if possible, rather than emitting a new stat from ES.

We could replace cumulative_sum(counter_rate(max(elasticsearch.ingest_pipeline.total.time.self.ms)))/cumulative_sum(counter_rate((max(elasticsearch.ingest_pipeline.total.count)))) with: cumulative_sum(pick_max(0, differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)))) / cumulative_sum(counter_rate((max(elasticsearch.ingest_pipeline.total.count))))

Or more specifically, replacing this clause: counter_rate(max(elasticsearch.ingest_pipeline.total.time.self.ms)) with: pick_max(0, differences(max(elasticsearch.ingest_pipeline.total.time.self.ms))) .

Here's some example data comparing these metrics:	data set	values
Input Stat	5 6 7 6 7 1
Metric w/ counter rate	5 1 1 6 1 1
Metric w/ differences	5 1 1 0 1 0

When there is a decrease in the input stat value, this will output 0 instead of the value of the stat after the decrease, as is done by counter_rate. This can be seen where there are 0's in the bottom row.

Unfortunately, this will has a small bug. If the large decrease between stat values 7 and 1 are due to a node restart, 1 should be the output rather than 0. But node restarts are relatively rare, so I do not think this is a big problem. An alternative way to deal with this is to also add an ifelse to the denominator and thus not include the counts in the average computation when there is a decrease:

cumulative_sum(pick_max(0, differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)))) / cumulative_sum(ifelse(differences(max(elasticsearch.ingest_pipeline.total.time.self.ms)) < 0, 0,counter_rate((max(elasticsearch.ingest_pipeline.total.count)))))

What do y'all think?

SimonKoetting commented 6 months ago

Seems like an easy, even if not perfect, improvement/fix for this issue. Especially including the if-else should lead to pretty accurate numbers in here, we then just should update all Visualizations on the Dashboard to use this "new" way of calculation.

+1 from my Side

slobodanadamovic commented 5 months ago

Sounds good from my side. Thanks for looking into this. I will take time to update our monitoring dashboards with the proposal and let you know in case I have any followup questions.

elastic / elasticsearch