Stackdriver / collectd

Stackdriver's monitoring agent based on collectd (http://collectd.org).
https://cloud.google.com/monitoring/agent/
Other
51 stars 15 forks source link

Recover from transient GCE metadata server failures #139

Closed ghost closed 5 years ago

ghost commented 5 years ago

I'm experiencing an issue with the Stackdriver monitoring agent for VMs running in Google Cloud. Everything used to work as expected until yesterday (Dec, 1st).

Metrics are not sent to the Stackdriver Monitor console in Google Cloud. The error is reported below. Any ideas on what could be wrong? No changes were made.

Expected behavior

Metrics are published to Stackdriver

Actual behavior

Metrics are not being sent to Stackdriver Error in logs:

Dec 02 20:08:46 test collectd[29217]: write_gcm: Asking metadata server for auth token
Dec 02 20:08:46 test collectd[29217]: write_gcm: Error or buffer overflow when building auth_header
Dec 02 20:08:46 test collectd[29217]: write_gcm: wg_oauth2_get_auth_header failed.
Dec 02 20:08:46 test collectd[29217]: write_gcm: wg_transmit_unique_segment failed.
Dec 02 20:08:46 test collectd[29217]: write_gcm: wg_transmit_unique_segments failed. Flushing.

Steps to reproduce

Nothing specific, collectd has been up and running for months

brodul commented 5 years ago

Same issue here. Happens once a week persists 30 minutes. Started happening 14 days ago.

Version of /opt/stackdriver/collectd/sbin/stackdriver-collectd: collectd 5.5.2.git Operating system / distro: Ubuntu 16.04 LTS

Logs:

Jul 01 17:27:45 test collectd[12232]: write_gcm: Asking metadata server for auth token
Jul 01 17:27:45 test collectd[12232]: write_gcm: Error or buffer overflow when building auth_header
Jul 01 17:27:45 test collectd[12232]: write_gcm: wg_oauth2_get_auth_header failed.
Jul 01 17:27:45 test collectd[12232]: write_gcm: wg_transmit_unique_segment failed.
Jul 01 17:27:45 test collectd[12232]: write_gcm: wg_transmit_unique_segments failed. Flushing.

Let me know, if I can provide additional information.

igorpeshansky commented 5 years ago

This seems to be transient unavailability of the GCE metadata server. We will look into retrying the request (using this issue to track). In the meantime, a workaround would be to use private key authentication, which uses a different path to obtain credentials.

brodul commented 5 years ago

Happened again.Thank you for the workaround, I will try it out. Selection_079

jkohen commented 5 years ago

Sorry for the delay, this was fixed by https://github.com/Stackdriver/collectd/pull/140.Please make sure you are running version 5.5.2-383 or higher. If it happen regularly with a recent version of the agent, please reopen this issue.

sffc commented 2 years ago

I am encountering this issue in my GCP project. I have never seen it before for the last 2 years, and now I've been going 30 hours without any metrics successfully getting ingested.

Example logs from the collectd service:

[sffc@<redacted> ~]$ sudo systemctl status stackdriver-agent
● stackdriver-agent.service - LSB: start and stop Stackdriver Agent
   Loaded: loaded (/etc/rc.d/init.d/stackdriver-agent; generated)
   Active: active (running) since Mon 2022-01-24 03:56:41 UTC; 32min ago
     Docs: man:systemd-sysv-generator(8)
 Main PID: 1261 (stackdriver-col)
    Tasks: 14 (limit: 39792)
   Memory: 5.9M
   CGroup: /system.slice/stackdriver-agent.service
           └─1261 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /etc/stackdriver/collectd.conf -P /var/run/stackdriver-agent.pid

Jan 24 04:27:41 <redacted> collectd[1261]: write_gcm: Asking metadata server for auth token
Jan 24 04:27:41 <redacted> collectd[1261]: write_gcm: Error or buffer overflow when building auth_header
Jan 24 04:27:41 <redacted> collectd[1261]: write_gcm: wg_oauth2_get_auth_header failed.
Jan 24 04:27:41 <redacted> collectd[1261]: write_gcm: wg_transmit_unique_segment failed.
Jan 24 04:27:41 <redacted> collectd[1261]: write_gcm: wg_transmit_unique_segments failed. Flushing.
Jan 24 04:28:41 <redacted> collectd[1261]: write_gcm: Asking metadata server for auth token
Jan 24 04:28:41 <redacted> collectd[1261]: write_gcm: Error or buffer overflow when building auth_header
Jan 24 04:28:41 <redacted> collectd[1261]: write_gcm: wg_oauth2_get_auth_header failed.
Jan 24 04:28:41 <redacted> collectd[1261]: write_gcm: wg_transmit_unique_segment failed.
Jan 24 04:28:41 <redacted> collectd[1261]: write_gcm: wg_transmit_unique_segments failed. Flushing.

Version information:

[sffc@<redacted> ~]$ sudo yum info stackdriver-agent
Last metadata expiration check: 4:26:58 ago on Mon 24 Jan 2022 12:08:53 AM UTC.
Installed Packages
Name         : stackdriver-agent
Version      : 5.5.2
Release      : 1002.el8
Architecture : x86_64
Size         : 5.6 M
Source       : stackdriver-agent-5.5.2-1002.el8.src.rpm
Repository   : @System
From repo    : google-cloud-monitoring
Summary      : Stackdriver system metrics collection daemon
URL          : http://www.stackdriver.com/
License      : GPLv2
Description  : The Stackdriver system metrics daemon collects system statistics and
             : sends them to the Stackdriver service.
             :
             : Currently includes collectd.
sffc commented 2 years ago

It appears that my yum repo config file was out-of-date and not picking up version 6 of stackdriver-agent. I've upgraded to version 6.1.4 and it appears to be working again.

mikehardenize commented 2 years ago

Same issue here. I had to change my yum repo base url from:

https://packages.cloud.google.com/yum/repos/google-cloud-monitoring-el7-x86_64

To:

https://packages.cloud.google.com/yum/repos/google-cloud-monitoring-el7-x86_64-6

And then do a yum update stackdriver-agent

e-compagno commented 2 years ago

I have solved just by updating the monitoring-agent with

sudo bash add-monitoring-agent-repo.sh --also-install

then restarting it with sudo service stackdriver-agent restart.

Detailed instructions can be found here.