cloudposse / prometheus-to-cloudwatch

Utility for scraping Prometheus metrics from a Prometheus client endpoint and publishing them to CloudWatch
https://cloudposse.com/accelerate
Apache License 2.0
169 stars 37 forks source link

Sanitise values before sending to CloudWatch #17

Closed sergei-ivanov closed 5 years ago

sergei-ivanov commented 5 years ago

Throw away data points with known invalid values to avoid the whole batch being rejected by CloudWatch API.

sergei-ivanov commented 5 years ago

We were seeing a number of errors like these in the logs:

2019/06/23 21:07:25 prometheus-to-cloudwatch: error publishing to CloudWatch: InvalidParameterValue: The value � for parameter MetricData.member.9.Value is invalid.
    status code: 400, request id: e687e984-95fa-11e9-a0f4-4f92e001f8b0
2019/06/23 21:07:25 prometheus-to-cloudwatch: error publishing to CloudWatch: InvalidParameterValue: The value � for parameter MetricData.member.10.Value is invalid.
    status code: 400, request id: e6a7cd85-95fa-11e9-a0f4-4f92e001f8b0
2019/06/23 21:07:25 prometheus-to-cloudwatch: error publishing to CloudWatch: InvalidParameterValue: The value � for parameter MetricData.member.10.Value is invalid.
    status code: 400, request id: e6cc1dc3-95fa-11e9-8838-1b8ac578faec
2019/06/23 21:07:26 prometheus-to-cloudwatch: published 89 metrics to CloudWatch

It turned out they were related to some JVM metrics that returned a NaN value, e.g.:

# HELP tomcat_threads_busy_threads  
# TYPE tomcat_threads_busy_threads gauge
tomcat_threads_busy_threads{name="http-nio-9982",} NaN

Since CloudWatch is unable to process these values, there is no much point in sending them in in the first place. So this PR filters out all data points with values that would be invalid from CloudWatch point of view.

With the fix applied, the errors go away, and the 3 rogue data points are blocked, which results in total number of metrics going down from 89 to 86:

2019/06/23 21:04:47 prometheus-to-cloudwatch: published 86 metrics to CloudWatch