Closed lots0logs closed 3 years ago
@lots0logs I can't reproduce this with just spinning up a 20.04 droplet with the agent - it seems to be working just fine on the 10, 20.04 droplets I just spun up (and there is no spammy logs hitting the journal). We did have a similar report (https://github.com/digitalocean/do-agent/issues/228) - but in that case the user was using DOKS and it was with an alpha version of kube-state-metrics. I notice your logs also say "k8s-cluster-stage..", do you have kube-state-metrics installed? If so, what version of kube-state-metrics?
@bsnyder788 Yeah looks like I have rancher/coreos-kube-state-metrics:v1.9.5
container running on one of my nodes. Though the issue I described happens on all nodes.
Thanks for the extra info @lots0logs . I'll see if I can reproduce it on a k8s cluster and get to the bottom of why these errors are popping up for you.
@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding --web.listen
in the systemd unit file (in the ExecStart line). e.g. ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog
. After doing a systemctl daemon-reload
and a systemctl restart do-agent
, you should be able to do a curl localhost:9100
and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.
I have encountered the exactly same issue (even the systemd message is the same) with do-agent on two droplets running Ubuntu 20.04.1. One was set up yesterday while the other one has been in use for just a month or so. I have to stop the do-agent service.
P.S. I am not running Kubernetes on the two affected droplets.
@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding
--web.listen
in the systemd unit file (in the ExecStart line). e.g.ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog
. After doing asystemctl daemon-reload
and asystemctl restart do-agent
, you should be able to do acurl localhost:9100
and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.
Following your guide, I am able to get the following:
2 error(s) occurred:
* collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values
* collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4, do-agent cpu is running at around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. I can also confirm the error messages in /var/log/syslog and journalctl -xe Installed version: 3.7.1 For now I've just got rid of it with apt purge do-agent
@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding
--web.listen
in the systemd unit file (in the ExecStart line). e.g.ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog
. After doing asystemctl daemon-reload
and asystemctl restart do-agent
, you should be able to do acurl localhost:9100
and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.Following your guide, I am able to get the following:
2 error(s) occurred: * collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values * collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
When you do the curl localhost:9100
what is the raw output?
Is your 20.04 image the stock DO image or a custom 20.04 image?
On Fri, Oct 30, 2020, 5:15 AM plutocrat notifications@github.com wrote:
Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/digitalocean/do-agent/issues/233#issuecomment-719438147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .
@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding
--web.listen
in the systemd unit file (in the ExecStart line). e.g.ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog
. After doing asystemctl daemon-reload
and asystemctl restart do-agent
, you should be able to do acurl localhost:9100
and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.Following your guide, I am able to get the following:
2 error(s) occurred: * collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values * collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
When you do the
curl localhost:9100
what is the raw output?
The quoted part was exactly what I got when I did curl localhost:9100
.
Is your 20.04 image the stock DO image or a custom 20.04 image? … On Fri, Oct 30, 2020, 5:15 AM plutocrat @.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .
Both affected droplets were built from the stock DO Ubuntu 20.04 LTS image with "Monitoring" option checked at the DO dashboard https://cloud.digitalocean.com/droplets/new.
@lots0logs I was not able to reproduce on it a k8s cluster either. Can you try adding
--web.listen
in the systemd unit file (in the ExecStart line). e.g.ExecStart=/opt/digitalocean/bin/do-agent --web.listen --syslog
. After doing asystemctl daemon-reload
and asystemctl restart do-agent
, you should be able to do acurl localhost:9100
and get the raw metrics that are being scraped. I would be curious to see if that is somehow having duplicate entries for the metrics in your original log.Following your guide, I am able to get the following:
2 error(s) occurred: * collected metric "node_filesystem_size_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:1.09422592e+08 > } was collected before with the same name and label values * collected metric "node_filesystem_free_bytes" { label:<name:"device" value:"/dev/vda15" > label:<name:"fstype" value:"vfat" > label:<name:"mountpoint" value:"/boot/efi" > gauge:<value:9.9854336e+07 > } was collected before with the same name and label values
When you do the
curl localhost:9100
what is the raw output?The quoted part was exactly what I got when I did
curl localhost:9100
.
Ok, thanks. I wanted to make sure that was all the info we could discern.
Is your 20.04 image the stock DO image or a custom 20.04 image? … On Fri, Oct 30, 2020, 5:15 AM plutocrat @.***> wrote: Confirmed here. Running 4 Ubuntu 20.04 droplets. On all 4 do-agent cpu is around 95% all the time. Tried the --web.listen instruction, but apparently nothing listening on that port when I do. Installed version: 3.7.1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXDLP3EWU6SM2GGR2YRJQDSNJ74XANCNFSM4SQ6RVVQ .
Both affected droplets were built from the stock DO Ubuntu 20.04 LTS image with "Monitoring" option checked at the DO dashboard https://cloud.digitalocean.com/droplets/new.
Thanks. I will try to reproduce from that .
I couldn't reproduce on a myriad of 20.04 droplets either, but I went ahead and made a new beta release that disables the collection of /boot
mountpoints. If some of you would give it a try to see if it now works on your specific droplets that would be fantastic. You can install it via curl -SsL https://repos.insights.digitalocean.com/install.sh | sudo BETA=1 bash
. Please let me know if that fixes your issues. cc @UnKnoWn-Consortium @lots0logs @plutocrat
@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue. At least it is no longer spamming those two error messages and taking a whole lot of CPU resources.
@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue.
Thank you so much for testing it out @UnKnoWn-Consortium, that is great that it is helping out. I will leave this release in beta over the weekend and check in early next week to make sure no other regressions or issues have shown themselves to you by then and if all is good, I will promote 3.8.0 to stable.
@bsnyder788 The 3.8.0 pre-release you have just made seems to have fixed the issue.
Thank you so much for testing it out @UnKnoWn-Consortium, that is great that it is helping out. I will leave this release in beta over the weekend and check in early next week to make sure no other regressions or issues have shown themselves to you by then and if all is good, I will promote 3.8.0 to stable.
Okay I will keep an eye out and see if anything goes astray with it (I seriously hope not). Have a nice weekend btw.
Also confirming DO stock Ubuntu 20.04 build. Have installed the beta release on one of the four affected boxes, and its showing healthy, near-zero CPU. Thanks. Will monitor.
24 hours later, and its still OK. Note: if you've been affected by this issue you might want to clean out your systemctl logs. Just got rid of 3.5 Gb of spam from mine using "/bin/journalctl --vacuum-size=500M". Your mileage may vary: there may be more subtle ways to remove the logs from just do-agent, although I haven't found them.
Sorry for the delayed response. We had a hurricane here and I was without power for a few days. I'm glad to see that y'all were able to identify the problem and implement a fix! Thanks!!
Thanks all! I'm going to go ahead and release 3.8.0 on the stable branch as well.
3.8.0 is officially released. I am going to close this. Please open a new issue if you see anything similar in the future. Thanks!
Describe the problem
The agent process uses 250-300% CPU the entire time it's running. That can't be normal.
Steps to reproduce
Run do-agent on a droplet.
Expected behavior
Does not constantly eat 250-300% CPU.
System Information
Ubuntu 20.04.1
do-agent information:
Paste the entire output
/opt/digitalocean/bin/do-agent --version
:Ubuntu, Debian
apt-cache policy do-agent
:The systemd journal is being spammed constantly with the following: