Metricbeat – Statsd Module issue with count value when using timer/histogram metric type

Hi I would like to report issue with metricbeat Statsd module.

When flushing timer and histogram metric types, count value just keep growing instead of being reset, they are reset only when there is some inactivity between flushes.

The problem occurs with timer and histogram metric types for count value.

If you compare behaviour with GitHub - statsd/statsd: Daemon for easy but powerful stats aggregation they are resetting count value for timer metric type.

Configuration

Version: Metricbeat: 8.13.4
Docker image: docker.elastic.co/beats/metricbeat:8.13.4
Discuss Forum URL: N/A
Steps to Reproduce:

Enable statsd metricbeat module

#statsd.yml 
-   module: statsd
metricsets: [ "server" ]
host: "localhost"
port: "8125"
enabled: true
period: 10s

Output to console

#metricbeat.yml
output.console:
pretty: true

Send some example metrics while true; do echo "processing_item:3000|h" > /dev/udp/127.0.0.1/8125 ; sleep 3s; done
Example output

{
  "@timestamp": "2024-06-04T13:58:55.047Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.13.4"
  },
  "statsd": {
    "processing_item": {
      "median": 3000,
      "p75": 3000,
      "p95": 3000,
      "p99_9": 3000,
      "stddev": 0,
      "p99": 3000,
      "count": 4,
      "min": 3000,
      "max": 3000,
      "mean": 3000
    }
  },
}
{
  "@timestamp": "2024-06-04T13:59:05.047Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.13.4"
  },
  "statsd": {
    "processing_item": {
      "mean": 3000,
      "p95": 3000,
      "median": 3000,
      "p99": 3000,
      "count": 7, #expected 3
      "max": 3000,
      "p75": 3000,
      "p99_9": 3000,
      "min": 3000,
      "stddev": 0
    }
  },
}
{
  "@timestamp": "2024-06-04T13:59:15.046Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.13.4"
  },
  "statsd": {
    "processing_item": {
      "mean": 3000,
      "p75": 3000,
      "count": 10, #expected 3
      "stddev": 0,
      "p99_9": 3000,
      "median": 3000,
      "p95": 3000,
      "min": 3000,
      "max": 3000,
      "p99": 3000
    }
  }
}

We got count 4, 7, 10 instead expected count 4, 3, 3

period is undocumented setting for statsd module but allows to control how often the metric is reported (flushed) over its time-to-live window ttl.

When I sent 1 UDP message to metricbeat configured with the ES output, I noticed that documents are generated every period seconds (default 10s) over the time window defined by ttl setting (default 30s). In the corner case with period: 1s and ttl: 0s the same document is generated and index in ES every second (with incrementing @timestamp). 2024-07-29_16-39-23

Regarding the reported issue, it looks like the count is only reset when the metrics reaches its time-to-live defined by ttl. The workaround I found is to set ttl and period to something smaller than in the while-loop (e.g. 1s). This will allow the metric to reach its ttl and reset the count value. Be careful with small period as it can generate additional load to the ES cluster (depending on the number of metrics being monitored)

I agree that the count and all statistics should reset every flush. Currently it's impossible to make sense of the values as you cannot differentiate if a newer document is an "update" to the statistics reported previously, or it's a new "live" of the metric, or it's a duplicate caused by behavior described above when metric is reported every period over ttl window.

2024-07-29_18-36-08

Here are the key findings:

Reproducing and Observing the Behaviour

I looked into the issue and was able to reproduce the behavior where count values keep increasing. This is explained by the way metrics are handled, and it’s not a matter of counts being reset, but rather how metrics accumulate when updated frequently.

Detailed Analysis

Metric Sending:
- Here we are sending a processing_item histogram metric every 3 seconds using this script: while true; do echo "processing_item:3000|h" > /dev/udp/127.0.0.1/8125 ; sleep 3s; done
Period (10 seconds):
- Metrics are collected and aggregated every 10 seconds. Given that metrics arrive every 3 seconds, there will be multiple metrics (3-4) within each 10-second period.
  - First flush: at 10 seconds (may contain 4 metrics)
  - Second flush: at 20 seconds (may contain 3 metrics)
  - And so on, for each 10-second interval.
TTL ( Default 30 seconds):
- TTL (Time to Live) is the maximum duration a metric can stay in the system without being updated before it is considered stale and discarded.
- With a TTL of 30 seconds, any metric not updated within 30 seconds is considered stale.
- Since metrics are sent every 3 seconds, each metric will be updated before the TTL expires.
- The default is 30 seconds, but you can configure it in the same file as the period.[config]

Metric Updates and Staleness

When a New Data Point Arrives for the existing metric:
- Value Update: The metric’s value is updated.
- Timestamp Refresh: The lastSeen timestamp is updated, indicating the metric is still active.
  - Count Accumulation: The count increases with each new data point for that metric.

Example:
- If you send a metric processing_item every 3 seconds, each new data point updates the lastSeen timestamp for this metric, preventing it from being considered stale.

If you compare behaviour with GitHub - statsd/statsd: Daemon for easy but powerful stats aggregation they are resetting count value for timer metric type.

Aggregation and Metrics in StatsD

Behavior:
- StatsD aggregates metrics over a defined period (e.g., 10 seconds).
- No TTL in StatsD: StatsD does not use TTL. Once metrics are flushed, they are reset and no longer counted for the next interval unless new metrics arrive.
  - Result: If a metric is sent every 3 seconds and the period is 10 seconds, StatsD will report the count of metrics received within each 10-second window and reset the count after each flush.

count value just keep growing instead of being reset, they are reset only when there is some inactivity between flushes.

If the same metric keeps getting updated, it won't be considered stale, so the count isn't reset. This is because the metric is deemed still relevant.

But If you send different metrics (e.g., processing_item1, processing_item2, processing_item3):
- Each metric is treated as a separate entity.
- Each has its own internal storage and count.
- Metrics do not accumulate across different keys. Each will have its own count value updated accordingly.

To better understand the behaviour: Adjust the following settings:

Interval Frequency: How often you send metrics.
Period: How often metrics are fetched from the server.
TTL: How long metrics remain valid.
Also, try sending different metrics and observe the count values.

I am closing this issue as it’s not a bug but rather the expected behavior based on how data is reported when using the period and TTL configurations, as explained in detail above. However, feel free to reopen it if further discussion is needed or if you believe something was missed in the analysis.

@immon thanks for the analysis!

In the corner case with period: 1s and ttl: 0s the same document is generated and index in ES every second (with incrementing @timestamp).

We tested this setup with "TTL: 0", but as documentation suggests - ttl of zero means metrics will never expire and this makes the bug even worse, not better.

The workaround I found is to set ttl and period to something smaller than in the while-loop (e.g. 1s).

this might be ok for some smaller systems but our traffic is heavy and period less than 30s generates way too much data in the ES. The main problem in not the way documents are created and when, it's their values which are wrong. That problem persists no matter what TTL and period is set.

I agree that the count and all statistics should reset every flush. Currently it's impossible to make sense of the values as you cannot differentiate if a newer document is an "update" to the statistics reported previously, or it's a new "live" of the metric, or it's a duplicate caused by behavior described above when metric is reported every period over ttl window.

I'm trying to grasp why TTL exists in the first place in this use-case. The "original" statsd package linked in initial bug report resets all values at every flush. Why this module doesn't work the same? We know when the period expires, flush all collected/calculated metrics at that point and reset all the values for the new metrics collected. As far as I see, TTL is obsolete and only brings complexity and no benefit. Removal of the TTL parameter also removes the "staleness" problem and there are no more "update" or "live" metrics.

@ritalwar thank you too for the detailed analysis!

No TTL in StatsD: StatsD does not use TTL. Once metrics are flushed, they are reset and no longer counted for the next interval unless new metrics arrive.

Exactly! This metricbeat statsd module should behave the same, don't you think? There are now "stale" metrics and when the period is reached, all metrics are flushed and all data is reset.

If the same metric keeps getting updated, it won't be considered stale, so the count isn't reset. This is because the metric is deemed still relevant.

I think this is the main cause of this weird behavior. We shouldn't really track if the metric is stale or not. Simply, if the period has not passed yet, the metric is relevant, the count should update and that is it. When the period passes, metrics are flushed and all vales are reset. There is no need for TTL.

But If you send different metrics (e.g., processing_item1, processing_item2, processing_item3)

the problem is not sending different metrics, we have about hundred different ones. The problem are values in the "timer" and "histogram" type metrics which are not reset on every flush (period expired). This makes the graphs plotted for the duration or count of events based on this events not accurate.

To better understand the behavior: Adjust the following settings: Interval Frequency: How often you send metrics. Period: How often metrics are fetched from the server. TTL: How long metrics remain valid. Also, try sending different metrics and observe the count values.

"Interval Frequency" - we do not have control over this. When the metric is triggered, it is sent to statsd. I does't make sense to collect the metrics on our side and send them to statsd periodically, this module should take care of that, don't you think?
"Period" - this we can set to a minimum od 30s, but doesn't fix the issue
"TTL" - we tried a lot of combinations in conjunction with "period", but the issue occurs at every combination. It would be best if TTL could be removed, the logic on flush/reset values would be more simple
"try sending different metrics" - we do, most of our metrics are "count" type and most of them are ok, but "timer" and "histogram" metrics are way off

I am closing this issue as it’s not a bug but rather the expected behavior based on how data is reported when using the period and TTL configurations, as explained in detail above.

Well, this looks odd, we demonstrated that the values are wrong and "agreed" that the count and all statistics should reset every flush. If this can't be fixed, maybe the solution is to update the official docs and remove the "Timer" and "Histogram" from the list of supported metric types? This would prevent confusion for the potential users.

Thank you for your help!

I have same issues as described above regarding timer metrics.

Since no total is stored in ES, it should be calculated from avg and count. That is not possible since count is not reset on every flush, but is rather a monotone increasing value until metric becomes stale.

When displaying data on chart, with statsd metrics one could use sum to show total number of counts, but with beats one should use differences. That causes a few issues. Once a metric becomes stale and is reset, next value is negative. Other issue is that differences requires two data points to be calculated. So if only one metric is recorded within some time frame, differences displays no value because there is no other value to calculate from.

So my questions are:

how to display counts on chart
how to display total time consumed on chart?

I've created a backlog issue for us to analyse how we can handle timer and histogram metrics considering ttl, as just removing TTL might not be the best solution since some users might need it to track metrics over longer periods. I'll discuss the priority and take it on from there

@ritalwar In the meantime, could you please answer to my two questions?

I really need this functionality. How can I display number of received timer metrics and what is total time consumed.

For example, if in first 60 seconds I send two timer metrics

10ms and 15ms

And in next 60 seconds I send other two metrics

20ms and 25ms

How can I display in Kibana (when for example aggregating period of 15 minutes) to get values

4 (counts) and 70ms (total duration)

And what if after 20 minutes, same thing happens?

I should have on my chart two points of value 4 and two points of value 70ms

elastic / beats

Metricbeat – Statsd Module issue with count value when using timer/histogram metric type #39987