Stackdriver / stackdriver-prometheus

Prometheus support for Stackdriver
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
19 stars 12 forks source link

Error when reporting quantiles #18

Open michael-barker opened 6 years ago

michael-barker commented 6 years ago

What did you do?

Reporting quantiles to Prometheus using a Spring Boot application with Actuator and the Micrometer Prometheus Registry.

What did you expect to see?

I can view the quatiles in Graphana. I would expect to be able to view the quantiles in Stackdriver.

What did you see instead? Under which circumstances?

I get the following error from stackdriver-prometheus.

Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Metric kind for metric external.googleapis.com/prometheus/http_server_requests_seconds must be CUMULATIVE, but is GAUGE.: timeSeries[54,57,60]

These are the relevant metrics scraped by Prometheus.

http_server_requests_seconds{exception="None",method="GET",status="200",uri="/actuator/prometheus",quantile="0.99",} 0.0262144
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 2.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 0.534899535
http_server_requests_seconds{exception="None",method="GET",status="200",uri="/",quantile="0.99",} 0.595591168
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/",} 85.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/",} 25.994628965
http_server_requests_seconds{exception="RuntimeException",method="GET",status="500",uri="/",quantile="0.99",} 0.411041792
http_server_requests_seconds_count{exception="RuntimeException",method="GET",status="500",uri="/",} 66.0
http_server_requests_seconds_sum{exception="RuntimeException",method="GET",status="500",uri="/",} 20.575536392

http_server_requests_seconds_count, http_server_requests_seconds_sum, and http_server_requests_seconds_max all show up in Stackdriver but I don't see http_server_requests_seconds.

Environment

JVM application built with the default Jib base image.

Default.

Default.

No modifications from the default other than setting project ID, cluster name, and cluster location.

Unmodified.

jkohen commented 5 years ago

Thanks for the report. Did you modify the dump in any way? It looks like it's missing the metadata which notifies the Stackdriver translator that this is cumulative, and that could explain it being sent as a GAUGE.

I'd expect the output of your Prometheus endpoint to look as follows:

# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds summary
http_server_requests_seconds{exception="None",method="GET",status="200",uri="/actuator/prometheus",quantile="0.99",} 0.0262144
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 2.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 0.034334051
http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 0.032477774

Did this help you solve the issue?

michael-barker commented 5 years ago

Sorry, I didn't realize I left some things off. Here's a more complete example from the scrape endpoint.

# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds summary
http_server_requests_seconds{exception="None",method="GET",status="200",uri="/actuator/prometheus",quantile="0.99",} 0.046137344
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 3.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 1.077069697
http_server_requests_seconds{exception="None",method="GET",status="200",uri="/",quantile="0.99",} 0.0
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/",} 0.379984312
# HELP http_server_requests_seconds_max  
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/actuator/prometheus",} 0.978263554
http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/",} 0.379984312
jkohen commented 5 years ago

Michael, thank you for submitting the full response. I re-read your report and looked at the relevant code and now I think I know what happened.

To make it work, you need to delete the metric descriptor for external.googleapis.com/prometheus/http_server_requests_seconds via an API call, which you can make at https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.metricDescriptors/delete. Note that this operation will delete any data previously in this metric, but it won't affect other metrics.

Each quantile for a Prometheus summary metric is written to Stackdriver as an individual GAUGE. It's possible that something else in your system previously wrote metric external.googleapis.com/prometheus/http_server_requests_seconds as a CUMULATIVE. Because Stackdriver considers a metric type change as incompatible, as it would lose the data, it won't accept the write.

Thank you for your patience. Let me know whether it helped. I would be particularly interested if you are still running a different Prometheus client that insists to export http_server_requests_seconds as a CUMULATIVE, as Prometheus tolerates that, but also wouldn't work particularly well in aggregations.

michael-barker commented 5 years ago

@jkohen Thanks for the detailed explanation! That did fix the issue. Does that mean all applications that report the http_server_requests_seconds metric then all of them have to report either a historgram or summary? There couldn't be one app reporting a quantile summary and another reporting a histogram? Sounds like this might be a limitation of Stackdriver.

jkohen commented 5 years ago

@fabxc what are your thoughts on this? @michael-barker you are correct. I wonder how often it is the case that a metric has different types. Queries that aggregate both wouldn't make sense. We can find ways around this, but the behavior will be harder to explain to users.