elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.76k stars 8.17k forks source link

Network metric calculations for the Host UI might be showing wrong values #184099

Closed lahsivjar closed 2 months ago

lahsivjar commented 4 months ago

Kibana version:

Elasticsearch version: N/A

Server OS version: N/A

Browser version: N/A

Browser OS version: N/A

Original install method (e.g. download page, yum, from source, etc.): N/A

Describe the bug:

The Network Metrics calculates bytes per second using the metricset.period field but also seems to normalize by per-second:

Screenshot 2024-05-21 at 6 01 25 PM

This, in my understanding, tries to calculate per second value of an already per second metric based on the size of the bucket on x-axis. Is this the intention of the metric or am I missing something here?

As per my understanding, we should only be using normalize by seconds (without the special calculation involving metricset.period) since the UI seems to be doing its own grouping on x-axis and calculating the data based on the UIs grouping should give the correct result. I have attached 2 snapshots of the above metric which has metricset.period set to 10. The first picture shows 3 metrics:

  1. normalized/sec
  2. normalized by using metricset.period
  3. both 1 and 2 applied (what we are doing now)

First pic shows both 1 and 2 as reporting same values since (from what I understand) x-axis time bucketing is same as metricset.period which is equal to 10s. The current metrics seem incorrect to me but not sure if my brain is missing something obvious or if the intention of the metrics is something different.

Screenshot 2024-05-21 at 11 44 55 PM Screenshot 2024-05-21 at 11 45 20 PM

Steps to reproduce: N/A

Expected behavior: The metric is correctly plotted

Screenshots (if relevant): Attached in the description section with context.

Errors in browser console (if relevant): N/A

Provide logs and/or server output (if relevant): N/A

Any additional context:

elasticmachine commented 4 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

felixbarny commented 4 months ago

To me, it also looks like we're doing double-normalization here. I also don't understand why we're taking the average value of the gauge instead of summing up all of the values in the bucket and then just use the Normalize by unit > per second option in lens to normalize on to per-second values.

roshan-elastic commented 4 months ago

Thanks @felixbarny - I'll pick this up on Friday to figure out prioritisation. I think it's just a lens tweak when I was shown a likely fix.

felixbarny commented 3 months ago

Instead of average(host.network.ingress.bytes), I think this also needs to use a sum, so that it sums up all the deltas that fall into a date_histogram bucket. For example, when the visualization shows a value for each 1m and the shipper sends delta metrics every 10s, we need to sum up the deltas instead of taking the average value.

crespocarlos commented 2 months ago

The network charts are wrong, we shouldn't double-normalize the value. Considering that what the hosts table shows IS correct:

ES aggregation executed by the query:

{
  rx_avg: {
    avg: {
      field: 'host.network.ingress.bytes',
    },
  },
  rx_period: {
    filter: {
      exists: {
        field: 'host.network.ingress.bytes',
      },
    },
    aggs: {
      period: {
        max: {
          field: 'metricset.period',
        },
      },
    },
  },
  rx: {
    bucket_script: {
      buckets_path: {
        value: 'rx_avg',
        period: 'rx_period>period',
      },
      script: {
        source: 'params.value / (params.period / 1000)',
        lang: 'painless',
      },
      gap_policy: 'skip',
    },
  },
}

RX value for 15m worth of data: 200.6 kbit/s

image

Replicating table value in the Metric chart

Replicating table value in the Line chart

Both sum(host.network.ingress.bytes) * 8 and average(host.network.ingress.bytes) * 8 / (max(metricset.period, kql='host.network.ingress.bytes: *') / 1000) (without double normalization) return the same value. However the latter is more accurate with shorter time ranges.

RX value with 15m range: 125.3kbit/s

Table

image

Chart

image

Calculating the average from the data points in the chart :

RX value with 5m range: 134.5kbit/s

Table

image

Chart

image

Calculating the average from the data points in the chart:

average(host.network.ingress.bytes) * 8 / (max(metricset.period, kql='host.network.ingress.bytes: *') / 1000) without double normalization is more accurate because it takes into account the metricset.period which is consistent to what the table does. If considering the metricset.period is wrong we'd have to change the aggregation too.

cc: @felixbarny @lahsivjar @roshan-elastic

felixbarny commented 2 months ago

I don't think we can assume that what the table shows is the correct value and that we need to replicate in the visualization what the table is doing. I think a better approach to arrive at a correct visualization is to think what we want to visualize from scratch and to manually validate that by looking at the individual data points from a set of sample documents.

Any formula that uses the average instead of the sum will only work in the special case where the reporting interval (~metricset.period) happens to be the same as the step interval used in the UI. For example, when host.network.ingress.bytes increments by 10 every reporting interval (10s), and the visualization also shows a datapoint every 10s, it doesn't matter whether you use a sum or avg, as there's always exactly one data point in each bucket. However, if you're looking at a larger time range, and the graph shows a datapoint for every minute, we need to sum up all the values (=60) instead of taking the average (=10) to get the total increase of the counter for each minute. We can then let Lens normalize the value per second (60bytes/60s=1bytes/s).

I don't think we should be using the metricset period to do the normalization and instead rely on the Lens normalization feature. Otherwise, the results are incorrect if the reporting interval changes and the metricset.period is not available for OTel or custom metrics, so I don't think that this is a method that is generally applicable. It also makes the formula a lot more complex.

crespocarlos commented 2 months ago

Any formula that uses the average instead of the sum will only work in the special case where the reporting interval (~metricset.period) happens to be the same as the step interval used in the UI. For example, when host.network.ingress.bytes increments by 10 every reporting interval (10s), and the visualization also shows a datapoint every 10s, it doesn't matter whether you use a sum or avg, as there's always exactly one data point in each bucket. However, if you're looking at a larger time range, and the graph shows a datapoint for every minute, we need to sum up all the values (=60) instead of taking the average (=10) to get the total increase of the counter for each minute. We can then let Lens normalize the value per second (60bytes/60s=1bytes/s).

@felixbarny I see. That is true when there is a date histogram. When there is not, the average will return the correct value for the table, otherwise, the result will be the sum of all that data within the time range. We don't need to necessarily normalize it by metriset.period

Perhaps I'm getting this wrong, but I don't think the average on the table and what the line chart shows will match - they currently don't as my comment above shows. I was trying to see if they could.

felixbarny commented 2 months ago

This is also true for when there's no date histogram. It's like having one big bucket in the date histogram where we can also just sum up all deltas and then normalize to a per second value by dividing by the time range. Maybe there's some nuance around cases where only pars of the selected time range contain values.

This way, we can have consistency across what the table, the gauge, and the graph shows.

crespocarlos commented 2 months ago

@felixbarny , got it. doing as you said resulted in something that makes more sense.

The following aggregation works:

rx_sum: {
    sum: {
      field: 'host.network.ingress.bytes',
    },
  },
  min_timestamp: {
    min: {
      field: '@timestamp',
    },
  },
  max_timestamp: {
    max: {
      field: '@timestamp',
    },
  },
  rx: {
    bucket_script: {
      buckets_path: {
        value: 'rx_sum',
        minTime: 'min_timestamp',
        maxTime: 'max_timestamp',
      },
      script: {
        source: 'params.value / ((params.maxTime - params.minTime) / 1000)',
        lang: 'painless',
      },
      gap_policy: 'skip',
    },
  },
image

Metric chart

image

I'm not entirely sure why the value here doesn't match. There seems to be some rounding problem.

Line chart

image

It's not normalized because doing so resulted in a rate per second that did not match the above results.

The above line chart has the following data points:

1.1Mbit
1.3Mbit
2.2Mbit
515.9Mbit
18.2Mbit
1.5Mbit
14.0Mbit
13.2Mbit
15.3Mbit
14.4Mbit
2.8Mbit
1.8Mbit
2.2Mbit
1.6Mbit
1.6Mbit
802.8kbit
363.5kbit
536.7Mbit
34.7Mbit
1.8Mbit
15.3Mbit
6.8Mbit
1.7Mbit
4.8Mbit
9.5Mbit
1.4Mbit
2.2Mbit
2.0Mbit
1.4Mbit
1.4Mbit

Sum: 1221.7283Mbit Duration in seconds: 300 ​Rate: 1221.7283 / 300.11 = 4.07 Mbit/s

felixbarny commented 2 months ago

It's not normalized because doing so resulted in a rate per second that did not match the above results. ​Rate: 1221.7283 / 300.11 = 4.07 Mbit/s

4.07 Mbit/s ~= 4.1Mbit/s

Considering every data point may be rounded, looks like this is pretty aligned with the gauge visualization.

crespocarlos commented 2 months ago

@roshan-elastic the proposal here would be to change the RX and TX formulas to perform a sum of the deltas to calculate the rate per second.

This will impact the corresponding inventory alerts, and we'll have to replicate the same solution for the CPU usage metric.

crespocarlos commented 2 months ago

This issue will be addressed here https://github.com/elastic/kibana/issues/188641