digitalocean / do-agent

Collects system metrics from DigitalOcean Droplets
Apache License 2.0
597 stars 99 forks source link

do-agent process constant high CPU usage #233

Closed lots0logs closed 3 years ago

lots0logs commented 3 years ago

Describe the problem

The agent process uses 250-300% CPU the entire time it's running. That can't be normal.

Steps to reproduce

Run do-agent on a droplet.

Expected behavior

Does not constantly eat 250-300% CPU.

System Information

Ubuntu 20.04.1

do-agent information:

Paste the entire output

/opt/digitalocean/bin/do-agent --version:

do-agent (DigitalOcean Agent)

Version:     3.7.1
Revision:    32704ad
Build Date:  Mon Oct  5 16:27:32 UTC 2020
Go Version:  go1.15.2
Website:     https://github.com/digitalocean/do-agent

Copyright (c) 2020 DigitalOcean, Inc. All rights reserved.

This work is licensed under the terms of the Apache 2.0 license.
For a copy, see <https://www.apache.org/licenses/LICENSE-2.0.html>.

Ubuntu, Debian

apt-cache policy do-agent:

do-agent:
Installed: 3.7.1
Candidate: 3.7.1
Version table:
*** 3.7.1 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
100 /var/lib/dpkg/status
3.6.0 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.5.6 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.5.5 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.5.4 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.5.2 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.5.1 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.3.1 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.2.1 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
3.0.5 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.2.4 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.2.3 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.2.1 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.2.0 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.1.3 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.0.2 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.0.1 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
2.0.0 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages
1.1.3 500
500 https://repos.insights.digitalocean.com/apt/do-agent main/main amd64 Packages

The systemd journal is being spammed constantly with the following:

Logs begin at Tue 2020-10-13 22:28:25 UTC, end at Wed 2020-10-14 18:24:57 UTC. --
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Oct 14 18:24:57 k8s-cluster-stage--worker-3 /opt/digitalocean/bin/do-agent[901]: /home/do-agent/cmd/do-agent/run.go:60: failed to gather metrics: 2 error(s) occurred:
bsnyder788 commented 3 years ago

@lots0logs I can't reproduce this with just spinning up a 20.04 droplet with the agent - it seems to be working just fine on the 10, 20.04 droplets I just spun up (and there is no spammy logs hitting the journal). We did have a similar report (https://github.com/digitalocean/do-agent/issues/228) - but in that case the user was using DOKS and it was with an alpha version of kube-state-metrics. I notice your logs also say "k8s-cluster-stage..", do you have kube-state-metrics installed? If so, what version of kube-state-metrics?

lots0logs commented 3 years ago

@bsnyder788 Yeah looks like I have rancher/coreos-kube-state-metrics:v1.9.5 container running on one of my nodes. Though the issue I described happens on all nodes.

bsnyder788 commented 3 years ago

Thanks for the extra info @lots0logs . I'll see if I can reproduce it on a k8s cluster and get to the bottom of why these errors are popping up for you.

bsnyder788 commented 3 years ago

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

UnKnoWn-Consortium commented 3 years ago

I have encountered the exactly same issue (even the systemd message is the same) with do-agent on two droplets running Ubuntu 20.04.1. One was set up yesterday while the other one has been in use for just a month or so. I have to stop the do-agent service.

P.S. I am not running Kubernetes on the two affected droplets.

UnKnoWn-Consortium commented 3 years ago

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:


2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
plutocrat commented 3 years ago

Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4, do-agent cpu is running at around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. I can also confirm the error messages in /var/log/syslog and journalctl -xe Installed version: 3.7.1 For now I've just got rid of it with apt purge do-agent

bsnyder788 commented 3 years ago

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:

2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values

When you do the curl localhost:9100 what is the raw output?

bsnyder788 commented 3 years ago

Is your 20.04 image the stock DO image or a custom 20.04 image?

On Fri, Oct 30, 2020, 5:15 AM plutocrat notifications@github.com wrote:

Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/digitalocean/do-agent/issues/233#issuecomment-719438147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .

UnKnoWn-Consortium commented 3 years ago

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:

2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values

When you do the curl localhost:9100 what is the raw output?

The quoted part was exactly what I got when I did curl localhost:9100.

UnKnoWn-Consortium commented 3 years ago

Is your 20.04 image the stock DO image or a custom 20.04 image? On Fri, Oct 30, 2020, 5:15 AM plutocrat @.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .

Both affected droplets were built from the stock DO Ubuntu 20.04 LTS image with "Monitoring" option checked at the DO dashboard https://cloud.digitalocean.com/droplets/new.

bsnyder788 commented 3 years ago

@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog. After doing a systemctl daemon-reload and a systemctl restart do-agent, you should be able to do a curl localhost:9100 and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.

Following your guide, I am able to get the following:

2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values

When you do the curl localhost:9100 what is the raw output?

The quoted part was exactly what I got when I did curl localhost:9100.

Ok, thanks. I wanted to make sure that was all the info we could discern.

bsnyder788 commented 3 years ago

Is your 20.04 image the stock DO image or a custom 20.04 image? On Fri, Oct 30, 2020, 5:15 AM plutocrat @.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .

Both affected droplets were built from the stock DO Ubuntu 20.04 LTS image with "Monitoring" option checked at the DO dashboard https://cloud.digitalocean.com/droplets/new.

Thanks. I will try to reproduce from that .

bsnyder788 commented 3 years ago

I couldn't reproduce on a myriad of 20.04 droplets either, but I went ahead and made a new beta release that disables the collection of /boot mountpoints. If some of you would give it a try to see if it now works on your specific droplets that would be fantastic. You can install it via curl -SsL https://repos.insights.digitalocean.com/install.sh | sudo BETA=1 bash . Please let me know if that fixes your issues. cc @UnKnoWn-Consortium @lots0logs @plutocrat

UnKnoWn-Consortium commented 3 years ago

@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue. At least it is no longer spamming those two error messages and taking a whole lot of CPU resources.

bsnyder788 commented 3 years ago

@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue.

Thank you so much for testing it out @UnKnoWn-Consortium, that is great that it is helping out. I will leave this release in beta over the weekend and check in early next week to make sure no other regressions or issues have shown themselves to you by then and if all is good, I will promote 3.8.0 to stable.

UnKnoWn-Consortium commented 3 years ago

@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue.

Thank you so much for testing it out @UnKnoWn-Consortium, that is great that it is helping out. I will leave this release in beta over the weekend and check in early next week to make sure no other regressions or issues have shown themselves to you by then and if all is good, I will promote 3.8.0 to stable.

Okay I will keep an eye out and see if anything goes astray with it (I seriously hope not). Have a nice weekend btw.

plutocrat commented 3 years ago

Also confirming DO stock Ubuntu 20.04 build. Have installed the beta release on one of the four affected boxes, and its showing healthy, near-zero CPU. Thanks. Will monitor.

plutocrat commented 3 years ago

24 hours later, and its still OK. Note: if you've been affected by this issue you might want to clean out your systemctl logs. Just got rid of 3.5 Gb of spam from mine using "/bin/journalctl --vacuum-size=500M". Your mileage may vary: there may be more subtle ways to remove the logs from just do-agent, although I haven't found them.

lots0logs commented 3 years ago

Sorry for the delayed response. We had a hurricane here and I was without power for a few days. I'm glad to see that y'all were able to identify the problem and implement a fix! Thanks!!

bsnyder788 commented 3 years ago

Thanks all! I'm going to go ahead and release 3.8.0 on the stable branch as well.

bsnyder788 commented 3 years ago

3.8.0 is officially released. I am going to close this. Please open a new issue if you see anything similar in the future. Thanks!