Closed abraxas81 closed 2 months ago
period
is undocumented setting for statsd module but allows to control how often the metric is reported (flushed) over its time-to-live window ttl
.
When I sent 1 UDP message to metricbeat configured with the ES output,
I noticed that documents are generated every period
seconds (default 10s) over the time window defined by ttl
setting (default 30s).
In the corner case with period: 1s
and ttl: 0s
the same document is generated and index in ES every second (with incrementing @timestamp
).
Regarding the reported issue, it looks like the count
is only reset when the metrics reaches its time-to-live defined by ttl
. The workaround I found is to set ttl
and period
to something smaller than in the while-loop (e.g. 1s
). This will allow the metric to reach its ttl and reset the count value. Be careful with small period
as it can generate additional load to the ES cluster (depending on the number of metrics being monitored)
I agree that the count
and all statistics should reset every flush. Currently it's impossible to make sense of the values as you cannot differentiate if a newer document is an "update" to the statistics reported previously, or it's a new "live" of the metric, or it's a duplicate caused by behavior described above when metric is reported every period
over ttl
window.
Here are the key findings:
Reproducing and Observing the Behaviour
Detailed Analysis
Metric Sending:
while true; do echo "processing_item:3000|h" > /dev/udp/127.0.0.1/8125 ; sleep 3s; done
Metric Updates and Staleness
processing_item
every 3 seconds, each new data point updates the lastSeen timestamp for this metric, preventing it from being considered stale.If you compare behaviour with GitHub - statsd/statsd: Daemon for easy but powerful stats aggregation they are resetting count value for timer metric type.
Aggregation and Metrics in StatsD
count value just keep growing instead of being reset, they are reset only when there is some inactivity between flushes.
If the same metric keeps getting updated, it won't be considered stale, so the count isn't reset. This is because the metric is deemed still relevant.
To better understand the behaviour: Adjust the following settings:
I am closing this issue as it’s not a bug but rather the expected behavior based on how data is reported when using the period and TTL configurations, as explained in detail above. However, feel free to reopen it if further discussion is needed or if you believe something was missed in the analysis.
@immon thanks for the analysis!
In the corner case with period: 1s and ttl: 0s the same document is generated and index in ES every second (with incrementing @timestamp).
- We tested this setup with "TTL: 0", but as documentation suggests - ttl of zero means metrics will never expire and this makes the bug even worse, not better.
The workaround I found is to set ttl and period to something smaller than in the while-loop (e.g. 1s).
- this might be ok for some smaller systems but our traffic is heavy and period less than 30s generates way too much data in the ES. The main problem in not the way documents are created and when, it's their values which are wrong. That problem persists no matter what TTL and period is set.
I agree that the count and all statistics should reset every flush. Currently it's impossible to make sense of the values as you cannot differentiate if a newer document is an "update" to the statistics reported previously, or it's a new "live" of the metric, or it's a duplicate caused by behavior described above when metric is reported every period over ttl window.
- I'm trying to grasp why TTL exists in the first place in this use-case. The "original" statsd package linked in initial bug report resets all values at every flush. Why this module doesn't work the same? We know when the period expires, flush all collected/calculated metrics at that point and reset all the values for the new metrics collected. As far as I see, TTL is obsolete and only brings complexity and no benefit. Removal of the TTL parameter also removes the "staleness" problem and there are no more "update" or "live" metrics.
@ritalwar thank you too for the detailed analysis!
No TTL in StatsD: StatsD does not use TTL. Once metrics are flushed, they are reset and no longer counted for the next interval unless new metrics arrive.
- Exactly! This metricbeat statsd module should behave the same, don't you think? There are now "stale" metrics and when the period is reached, all metrics are flushed and all data is reset.
If the same metric keeps getting updated, it won't be considered stale, so the count isn't reset. This is because the metric is deemed still relevant.
- I think this is the main cause of this weird behavior. We shouldn't really track if the metric is stale or not. Simply, if the period has not passed yet, the metric is relevant, the count should update and that is it. When the period passes, metrics are flushed and all vales are reset. There is no need for TTL.
But If you send different metrics (e.g., processing_item1, processing_item2, processing_item3)
- the problem is not sending different metrics, we have about hundred different ones. The problem are values in the "timer" and "histogram" type metrics which are not reset on every flush (period expired). This makes the graphs plotted for the duration or count of events based on this events not accurate.
To better understand the behavior: Adjust the following settings: Interval Frequency: How often you send metrics. Period: How often metrics are fetched from the server. TTL: How long metrics remain valid. Also, try sending different metrics and observe the count values.
I am closing this issue as it’s not a bug but rather the expected behavior based on how data is reported when using the period and TTL configurations, as explained in detail above.
- Well, this looks odd, we demonstrated that the values are wrong and "agreed" that the count and all statistics should reset every flush. If this can't be fixed, maybe the solution is to update the official docs and remove the "Timer" and "Histogram" from the list of supported metric types? This would prevent confusion for the potential users.
Thank you for your help!
I have same issues as described above regarding timer metrics.
Since no total
is stored in ES, it should be calculated from avg
and count
. That is not possible since count
is not reset on every flush, but is rather a monotone increasing value until metric becomes stale.
When displaying data on chart, with statsd metrics one could use sum
to show total number of counts, but with beats one should use differences
. That causes a few issues. Once a metric becomes stale and is reset, next value is negative. Other issue is that differences
requires two data points to be calculated. So if only one metric is recorded within some time frame, differences
displays no value because there is no other value to calculate from.
So my questions are:
I've created a backlog issue for us to analyse how we can handle timer and histogram metrics considering ttl, as just removing TTL might not be the best solution since some users might need it to track metrics over longer periods. I'll discuss the priority and take it on from there
@ritalwar In the meantime, could you please answer to my two questions?
I really need this functionality. How can I display number of received timer metrics and what is total time consumed.
For example, if in first 60 seconds I send two timer metrics
10ms and 15ms
And in next 60 seconds I send other two metrics
20ms and 25ms
How can I display in Kibana (when for example aggregating period of 15 minutes) to get values
4 (counts) and 70ms (total duration)
And what if after 20 minutes, same thing happens?
I should have on my chart two points of value 4 and two points of value 70ms
Hi I would like to report issue with metricbeat Statsd module.
When flushing timer and histogram metric types, count value just keep growing instead of being reset, they are reset only when there is some inactivity between flushes.
The problem occurs with timer and histogram metric types for count value.
If you compare behaviour with GitHub - statsd/statsd: Daemon for easy but powerful stats aggregation they are resetting count value for timer metric type.
Configuration
while true; do echo "processing_item:3000|h" > /dev/udp/127.0.0.1/8125 ; sleep 3s; done