aws / amazon-cloudwatch-agent

CloudWatch Agent enables you to collect and export host-level metrics and logs on instances running Linux or Windows server.
MIT License
436 stars 197 forks source link

collectd GAUGE NaN converted into a 0 #570

Closed edlins closed 11 months ago

edlins commented 2 years ago

Describe the bug When collectd network output plugin reports "nan" for a GAUGE value, CWA adds a 0-value metric data point.

Steps to reproduce I'm using the collectd tail input plugin to scrape for keywords. If the keyword does not appear in a 60s interval, the network output plugin reports a nan. If necessary, I can devise a degenerate test case for you.

What did you expect to see? Drop it, do not report a metric value for that interval.

What did you see instead? CloudWatch metric graph clearly shows a zero when collectd did not reported a nan (not a zero). This zero is fictitious.

What version did you use?

/opt/aws/amazon-cloudwatch-agent$ cat bin/CWAGENT_VERSION
1.247354.0b251981

What config did you use?

/opt/aws/amazon-cloudwatch-agent$ cat bin/config.json 
{
"agent": {
"run_as_user": XXX,
"region": YYY,
"debug": true
},
"metrics": {
"metrics_collected": {
"collectd": {
"collectd_security_level": "none"
},
"disk": {
"measurement": [
"used_percent"
],
"resources": [
"/"
],
"drop_device": true
}
}
}
}

Environment

/opt/aws/amazon-cloudwatch-agent$ cat /etc/issue
Ubuntu 22.04 LTS \n \l

Additional context I know that NaN is documented as not supported in CloudWatch: special values (for example, NaN, +Infinity, -Infinity) are not supported https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html However, NaN is Not a Number, which is most certainly != 0. I see that other plugins (prometheus_scraper for example) properly drop NaN.

~/Projects/amazon-cloudwatch-agent$ find . -type f -exec grep -H NaN {} \;
./plugins/inputs/prometheus_scraper/calculator.go:      log.Printf("D! Drop metric with NaN or Inf value: %v", pm)

Why does collectd data not get the same treatment?

edlins commented 2 years ago

See for yourself in 60 seconds.

collectd_cwa_Dockerfile.txt

Different issue - amazon-cloudwatch-agent-ctl does not work out of the box in docker. It needs

sed -ir 's/\\-\\.mount/tmp.mount/' /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl
edlins commented 2 years ago

FYI the top of the dockerfile attached above has its own instructions as:

# syntax = docker/dockerfile:1.4
#
# docker build:
#   DOCKER_BUILDKIT=1 docker build . -f collectd_cwa_Dockerfile.txt -t collectd
#
# docker run: 
#   docker run -e REGION=[region] -e SECRET_ACCESS_KEY=[secret access key] -e ACCESS_KEY_ID=[access key id] -d --rm --name collectd collectd
#
# get the hostname, to find in CloudWatch -> Metrics -> All Metrics -> Browse -> Search, as "CWAgent > host, instance, type, type_instance"
#   docker exec collectd hostname
#
# get collectd csv data (after running for > 60s):
#   docker exec collectd find /var/lib/collectd -type f -exec cat {} \;
#
# get the output of the collectd network plugin (ctrl-c to quit):
#   docker exec collectd tcpdump -i lo -n udp port 25826 -X
#
# get the CWA debug log (ctrl-c to quit):
#   docker exec collectd tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

This should be enough to get the zero's uploading to CloudWatch using whatever credentials you prefer. Of course the IAM permissions have to allow the relevant metric actions.

I've attempted to simplify the recreation of this issue as much as possible. Please let me know if there's anything else I can provide. If this can't or won't be fixed in the agent, I'll have to re-engineer the metrics I collect because it is critical for me to distinguish NaN (something did not happen) versus zero (something happened and experienced zero errors).

SaxyPandaBear commented 2 years ago

The main complication around NaN values is that the backend doesn't support it. https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html

Although the Value parameter accepts numbers of type Double, CloudWatch rejects values that are either too small or too large. Values must be in the range of -2^360 to 2^360. In addition, special values (for example, NaN, +Infinity, -Infinity) are not supported.

edlins commented 2 years ago

I understand NaN values are not supported in CloudWatch, so they should be dropped by the agent with no value uploaded. That way, alarms will treat this as missing data. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data

Sometimes, not every expected data point for a metric gets reported to CloudWatch. For example, this can happen when a connection is lost, a server goes down, or when a metric reports data only intermittently by design.

In my case "a metric reports data only intermittently by design." Currently, the NaN values are converted to zero and uploaded. NaN != 0 by definition. NaN = x is always False. https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN I used to use the collectd-cloudwatch plugin for collectd at https://github.com/awslabs/collectd-cloudwatch which worked correctly by dropping NaN. But that project has gone stale and was never remediated for python3. So I switched to the CWA and now my alarms are evergreen because of this NaN -> 0 "translation feature".

edlins commented 2 years ago

I took another look and it seems that for float64 this is already handled.

https://github.com/aws/amazon-cloudwatch-agent/blob/53040cdc24bfe683175fcec6d560b9e34136154b/internal/models/awscsm_pipeline.go#L90-L97

github-actions[bot] commented 1 year ago

This issue was marked stale due to lack of activity.

edlins commented 1 year ago

Well, sounds like WONTFIX. Caveat emptor: basic math fail, unusable. Maintainer won't even tag this as a BUG. Looking first at migrating my fleet to netdata + kinesis, for anyone else who comes across this.

khanhntd commented 1 year ago

Hi @edlins , thanks for contacting us. Look at the problem now !

github-actions[bot] commented 12 months ago

This issue was marked stale due to lack of activity.

jefchien commented 11 months ago

We've fixed this as part of https://github.com/aws/amazon-cloudwatch-agent/pull/847, which has been released as of v1.300028.0. The agent will now drop unsupported values like NaN and +/- Inf.