Closed sergei-ivanov closed 5 years ago
We were seeing a number of errors like these in the logs:
2019/06/23 21:07:25 prometheus-to-cloudwatch: error publishing to CloudWatch: InvalidParameterValue: The value � for parameter MetricData.member.9.Value is invalid.
status code: 400, request id: e687e984-95fa-11e9-a0f4-4f92e001f8b0
2019/06/23 21:07:25 prometheus-to-cloudwatch: error publishing to CloudWatch: InvalidParameterValue: The value � for parameter MetricData.member.10.Value is invalid.
status code: 400, request id: e6a7cd85-95fa-11e9-a0f4-4f92e001f8b0
2019/06/23 21:07:25 prometheus-to-cloudwatch: error publishing to CloudWatch: InvalidParameterValue: The value � for parameter MetricData.member.10.Value is invalid.
status code: 400, request id: e6cc1dc3-95fa-11e9-8838-1b8ac578faec
2019/06/23 21:07:26 prometheus-to-cloudwatch: published 89 metrics to CloudWatch
It turned out they were related to some JVM metrics that returned a NaN value, e.g.:
# HELP tomcat_threads_busy_threads
# TYPE tomcat_threads_busy_threads gauge
tomcat_threads_busy_threads{name="http-nio-9982",} NaN
Since CloudWatch is unable to process these values, there is no much point in sending them in in the first place. So this PR filters out all data points with values that would be invalid from CloudWatch point of view.
With the fix applied, the errors go away, and the 3 rogue data points are blocked, which results in total number of metrics going down from 89 to 86:
2019/06/23 21:04:47 prometheus-to-cloudwatch: published 86 metrics to CloudWatch
Throw away data points with known invalid values to avoid the whole batch being rejected by CloudWatch API.