awslabs / collectd-cloudwatch

A collectd plugin for sending data to Amazon CloudWatch
MIT License
200 stars 132 forks source link

Restarting EC2 Instance Stops Collectd From Sending to CloudWatch? #25

Open scott-wood-vgh opened 7 years ago

scott-wood-vgh commented 7 years ago

Hi,

I have somewhat of an urgent and baffling problem in that my collectd and CloudWatch plugin installation will push metrics fine, until the box is restarted. Then, its CloudWatch alarms go into Insufficient Data due to the following error in the collectd logs:

[2016-11-30 19:43:24] [AmazonCloudWatchPlugin][cloudwatch.modules.client.putclient] Could not put metric data using the following endpoint: 'https://monitoring.us-east-1.amazonaws.com/'. [Exception: HTTPSConnectionPool(host='monitoring.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: /?Action=PutMetricData&MetricData.member.1.Dimensions.member.1.Name=Host&MetricData.member.1.Dimensions.member.1.Value=i-ffd27e66&MetricData.member.1.Dimensions.member.2.Name=PluginInstance&MetricData.member.1.Dimensions.member.2.Value=NONE&MetricData.member.1.MetricName=collectd.gauge.Tomcat&MetricData.member.1.StatisticValues.Maximum=1.0&MetricData.member.1.StatisticValues.Minimum=1.0&MetricData.member.1.StatisticValues.SampleCount=6&MetricData.member.1.StatisticValues.Sum=6.0&MetricData.member.1.Timestamp=20161130T194224Z&MetricData.member.2.Dimensions.member.1.Name=Host&MetricData.member.2.Dimensions.member.1.Value=i-ffd27e66&MetricData.member.2.Dimensions.member.2.Name=PluginInstance&MetricData.member.2.Dimensions.member.2.Val [2016-11-30 19:43:24] [AmazonCloudWatchPlugin][cloudwatch.modules.client.putclient] Request details: 'Action=PutMetricData&MetricData.member.1.Dimensions.member.1.Name=Host&MetricData.member.1.Dimensions.member.1.Value=i-ffd27e66&MetricData.member.1.Dimensions.member.2.Name=PluginInstance&MetricData.member.1.Dimensions.member.2.Value=NONE&MetricData.member.1.MetricName=collectd.gauge.Tomcat&MetricData.member.1.StatisticValues.Maximum=1.0&MetricData.member.1.StatisticValues.Minimum=1.0&MetricData.member.1.StatisticValues.SampleCount=6&MetricData.member.1.StatisticValues.Sum=6.0&MetricData.member.1.Timestamp=20161130T194224Z&MetricData.member.2.Dimensions.member.1.Name=Host&MetricData.member.2.Dimensions.member.1.Value=i-ffd27e66&MetricData.member.2.Dimensions.member.2.Name=PluginInstance&MetricData.member.2.Dimensions.member.2.Value=NONE&MetricData.member.2.MetricName=load.load&MetricData.member.2.StatisticValues.Maximum=0.05&MetricData.member.2.StatisticValues.Minimum=0.01&MetricData.member.2.StatisticValues.SampleCount=18&Metric

What could be causing this issue to surface only after reboots? As far as I can tell, all boxes have identical networking, and it can affect any EC2 after I reboot it from the console. Is there maybe some CloudWatch plugin service that does not get started on boot? The collectd service is running fine and I've tried restarting it multiple times.

Thank you.

scott-wood-vgh commented 7 years ago

I've just realized even restarting the collectd service does it. I tried to enable debugging logging and managed to stop another box from pushing just from doing a sudo service collectd restart. The only extra info is that it seems to be trying to push what I expect:

[2016-11-24 13:56:13] [AmazonCloudWatchPlugin][cloudwatch.modules.flusher] [debug] flushing metrics collectd--gauge-Tomcat[5] load--load-[18] df-root-percent_bytes-used[6]

EDIT: Looks like a segfault in the syslog right before some instances stopped working. FWIW I'm on collectd.5-4.0

[7615552.971016] collectd[6891]: segfault at 98 ip 00007f43afd8c721 sp 00007ffe83a72ba0 error 4 in libpython2.7.so.1.0[7f43afc7a000+2dc000]