elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
112 stars 4.93k forks source link

Metricbeat – Statsd Module issue with count value when using timer/histogram metric type #39987

Closed abraxas81 closed 2 months ago

abraxas81 commented 5 months ago

Hi I would like to report issue with metricbeat Statsd module.

When flushing timer and histogram metric types, count value just keep growing instead of being reset, they are reset only when there is some inactivity between flushes.

The problem occurs with timer and histogram metric types for count value.

If you compare behaviour with GitHub - statsd/statsd: Daemon for easy but powerful stats aggregation they are resetting count value for timer metric type.

Configuration

  1. Enable statsd metricbeat module
    #statsd.yml 
    -   module: statsd
    metricsets: [ "server" ]
    host: "localhost"
    port: "8125"
    enabled: true
    period: 10s
  2. Output to console
    #metricbeat.yml
    output.console:
    pretty: true
  3. Send some example metrics while true; do echo "processing_item:3000|h" > /dev/udp/127.0.0.1/8125 ; sleep 3s; done
  4. Example output
{
  "@timestamp": "2024-06-04T13:58:55.047Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.13.4"
  },
  "statsd": {
    "processing_item": {
      "median": 3000,
      "p75": 3000,
      "p95": 3000,
      "p99_9": 3000,
      "stddev": 0,
      "p99": 3000,
      "count": 4,
      "min": 3000,
      "max": 3000,
      "mean": 3000
    }
  },
}
{
  "@timestamp": "2024-06-04T13:59:05.047Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.13.4"
  },
  "statsd": {
    "processing_item": {
      "mean": 3000,
      "p95": 3000,
      "median": 3000,
      "p99": 3000,
      "count": 7, #expected 3
      "max": 3000,
      "p75": 3000,
      "p99_9": 3000,
      "min": 3000,
      "stddev": 0
    }
  },
}
{
  "@timestamp": "2024-06-04T13:59:15.046Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.13.4"
  },
  "statsd": {
    "processing_item": {
      "mean": 3000,
      "p75": 3000,
      "count": 10, #expected 3
      "stddev": 0,
      "p99_9": 3000,
      "median": 3000,
      "p95": 3000,
      "min": 3000,
      "max": 3000,
      "p99": 3000
    }
  }
}
  1. We got count 4, 7, 10 instead expected count 4, 3, 3
immon commented 4 months ago

period is undocumented setting for statsd module but allows to control how often the metric is reported (flushed) over its time-to-live window ttl.

When I sent 1 UDP message to metricbeat configured with the ES output, I noticed that documents are generated every period seconds (default 10s) over the time window defined by ttl setting (default 30s). In the corner case with period: 1s and ttl: 0s the same document is generated and index in ES every second (with incrementing @timestamp). 2024-07-29_16-39-23

Regarding the reported issue, it looks like the count is only reset when the metrics reaches its time-to-live defined by ttl. The workaround I found is to set ttl and period to something smaller than in the while-loop (e.g. 1s). This will allow the metric to reach its ttl and reset the count value. Be careful with small period as it can generate additional load to the ES cluster (depending on the number of metrics being monitored)

I agree that the count and all statistics should reset every flush. Currently it's impossible to make sense of the values as you cannot differentiate if a newer document is an "update" to the statistics reported previously, or it's a new "live" of the metric, or it's a duplicate caused by behavior described above when metric is reported every period over ttl window.

2024-07-29_18-36-08

ritalwar commented 2 months ago

Here are the key findings:

Reproducing and Observing the Behaviour

Detailed Analysis

  1. Metric Sending:

    • Here we are sending a processing_item histogram metric every 3 seconds using this script: while true; do echo "processing_item:3000|h" > /dev/udp/127.0.0.1/8125 ; sleep 3s; done
  2. Period (10 seconds):
    • Metrics are collected and aggregated every 10 seconds. Given that metrics arrive every 3 seconds, there will be multiple metrics (3-4) within each 10-second period.
      • First flush: at 10 seconds (may contain 4 metrics)
      • Second flush: at 20 seconds (may contain 3 metrics)
      • And so on, for each 10-second interval.
  3. TTL ( Default 30 seconds):
    • TTL (Time to Live) is the maximum duration a metric can stay in the system without being updated before it is considered stale and discarded.
    • With a TTL of 30 seconds, any metric not updated within 30 seconds is considered stale.
    • Since metrics are sent every 3 seconds, each metric will be updated before the TTL expires.
    • The default is 30 seconds, but you can configure it in the same file as the period.[config]

Metric Updates and Staleness

If you compare behaviour with GitHub - statsd/statsd: Daemon for easy but powerful stats aggregation they are resetting count value for timer metric type.

Aggregation and Metrics in StatsD

count value just keep growing instead of being reset, they are reset only when there is some inactivity between flushes.

If the same metric keeps getting updated, it won't be considered stale, so the count isn't reset. This is because the metric is deemed still relevant.

To better understand the behaviour: Adjust the following settings:

ritalwar commented 2 months ago

I am closing this issue as it’s not a bug but rather the expected behavior based on how data is reported when using the period and TTL configurations, as explained in detail above. However, feel free to reopen it if further discussion is needed or if you believe something was missed in the analysis.

TomoZG commented 2 months ago

@immon thanks for the analysis!

In the corner case with period: 1s and ttl: 0s the same document is generated and index in ES every second (with incrementing @timestamp).

  • We tested this setup with "TTL: 0", but as documentation suggests - ttl of zero means metrics will never expire and this makes the bug even worse, not better.

The workaround I found is to set ttl and period to something smaller than in the while-loop (e.g. 1s).

  • this might be ok for some smaller systems but our traffic is heavy and period less than 30s generates way too much data in the ES. The main problem in not the way documents are created and when, it's their values which are wrong. That problem persists no matter what TTL and period is set.

I agree that the count and all statistics should reset every flush. Currently it's impossible to make sense of the values as you cannot differentiate if a newer document is an "update" to the statistics reported previously, or it's a new "live" of the metric, or it's a duplicate caused by behavior described above when metric is reported every period over ttl window.

  • I'm trying to grasp why TTL exists in the first place in this use-case. The "original" statsd package linked in initial bug report resets all values at every flush. Why this module doesn't work the same? We know when the period expires, flush all collected/calculated metrics at that point and reset all the values for the new metrics collected. As far as I see, TTL is obsolete and only brings complexity and no benefit. Removal of the TTL parameter also removes the "staleness" problem and there are no more "update" or "live" metrics.

@ritalwar thank you too for the detailed analysis!

No TTL in StatsD: StatsD does not use TTL. Once metrics are flushed, they are reset and no longer counted for the next interval unless new metrics arrive.

  • Exactly! This metricbeat statsd module should behave the same, don't you think? There are now "stale" metrics and when the period is reached, all metrics are flushed and all data is reset.

If the same metric keeps getting updated, it won't be considered stale, so the count isn't reset. This is because the metric is deemed still relevant.

  • I think this is the main cause of this weird behavior. We shouldn't really track if the metric is stale or not. Simply, if the period has not passed yet, the metric is relevant, the count should update and that is it. When the period passes, metrics are flushed and all vales are reset. There is no need for TTL.

But If you send different metrics (e.g., processing_item1, processing_item2, processing_item3)

  • the problem is not sending different metrics, we have about hundred different ones. The problem are values in the "timer" and "histogram" type metrics which are not reset on every flush (period expired). This makes the graphs plotted for the duration or count of events based on this events not accurate.

To better understand the behavior: Adjust the following settings: Interval Frequency: How often you send metrics. Period: How often metrics are fetched from the server. TTL: How long metrics remain valid. Also, try sending different metrics and observe the count values.

I am closing this issue as it’s not a bug but rather the expected behavior based on how data is reported when using the period and TTL configurations, as explained in detail above.

  • Well, this looks odd, we demonstrated that the values are wrong and "agreed" that the count and all statistics should reset every flush. If this can't be fixed, maybe the solution is to update the official docs and remove the "Timer" and "Histogram" from the list of supported metric types? This would prevent confusion for the potential users.

Thank you for your help!

zlikavac32 commented 2 months ago

I have same issues as described above regarding timer metrics.

Since no total is stored in ES, it should be calculated from avg and count. That is not possible since count is not reset on every flush, but is rather a monotone increasing value until metric becomes stale.

When displaying data on chart, with statsd metrics one could use sum to show total number of counts, but with beats one should use differences. That causes a few issues. Once a metric becomes stale and is reset, next value is negative. Other issue is that differences requires two data points to be calculated. So if only one metric is recorded within some time frame, differences displays no value because there is no other value to calculate from.

So my questions are:

ritalwar commented 2 months ago

I've created a backlog issue for us to analyse how we can handle timer and histogram metrics considering ttl, as just removing TTL might not be the best solution since some users might need it to track metrics over longer periods. I'll discuss the priority and take it on from there

zlikavac32 commented 2 months ago

@ritalwar In the meantime, could you please answer to my two questions?

I really need this functionality. How can I display number of received timer metrics and what is total time consumed.

For example, if in first 60 seconds I send two timer metrics

10ms and 15ms

And in next 60 seconds I send other two metrics

20ms and 25ms

How can I display in Kibana (when for example aggregating period of 15 minutes) to get values

4 (counts) and 70ms (total duration)

And what if after 20 minutes, same thing happens?

I should have on my chart two points of value 4 and two points of value 70ms