envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.99k stars 4.81k forks source link

envoyMetricsService Istio Histograms are Missing +Inf Upper Bound #12387

Open PatrickDuncan opened 4 years ago

PatrickDuncan commented 4 years ago

Title: envoyMetricsService Istio Histograms are Missing +Inf Upper Bound

Description: This is a problem with istio/proxyv2

With a normal sidecar/Kiali/Prometheus setup the Istio histogram metrics (ex. istio_request_duration_milliseconds_bucket) include the +Inf bucket.

With the metrics received by the gRPC envoyMetricsService the +Inf bucket is missing. This is a major issue for the histogram_quantile Prometheus function.

The highest bucket must have an upper bound of +Inf. (Otherwise, NaN is returned.)

For now I'll just convert the highest upper bound to +Inf in my metrics service.

[optional Relevant Links:] https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#RemoteService

Expected behavior

It should include the +Inf bucket. Here's what I receive right now:

[histogram:<sample_count:3 sample_sum:2565 bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket: bucket:<cumulative_count:3 upper_bound:1.8e+06 > bucket:<cumulative_count:3 upper_bound:3.6e+06 > > timestamp_ms:1595617404646 ]

Steps to reproduce the bug

Run a gRPC envoy metrics service with an Istio sidecar and print the metrics received.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

client version: 1.6.4
control plane version: 1.6.4
data plane version: 1.6.4 (12 proxies)

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:03Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.9+IKS", GitCommit:"dbe4e8a5c169e0a6a8a9b7b183561fa61e161985", GitTreeState:"clean", BuildDate:"2020-07-16T01:30:57Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
mattklein123 commented 4 years ago

cc @ggreenway who I think is our Prometheus expert at this point.

ggreenway commented 4 years ago

This is actually a completely different code-path than the admin prometheus stats handler. I've never looked at this before, but I think it should be a simple fix. In MetricsServiceSink::flushHistogram(), we have this:

  // Add bucket information for histograms.
  io::prometheus::client::MetricFamily* histogram_metrics_family = message_.add_envoy_metrics();
  histogram_metrics_family->set_type(io::prometheus::client::MetricType::HISTOGRAM);
  histogram_metrics_family->set_name(envoy_histogram.name());
  auto* histogram_metric = histogram_metrics_family->add_metric();
  histogram_metric->set_timestamp_ms(std::chrono::duration_cast<std::chrono::milliseconds>(
                                         time_source_.systemTime().time_since_epoch())
                                         .count());
  auto* histogram = histogram_metric->mutable_histogram();
  histogram->set_sample_count(hist_stats.sampleCount());
  histogram->set_sample_sum(hist_stats.sampleSum());
  for (size_t i = 0; i < hist_stats.supportedBuckets().size(); i++) {
    auto* bucket = histogram->add_bucket();
    bucket->set_upper_bound(hist_stats.supportedBuckets()[i]);
    bucket->set_cumulative_count(hist_stats.computedBuckets()[i]);
  }

The +Inf bucket is implied by the format, and isn't included in the config for buckets. So after the loop, add another bucket with bound +Inf and count of hist_stats.sampleCount().

PatrickDuncan commented 4 years ago

@ggreenway @mattklein123 Thank you. I should be able to make this change at some point