Closed gregolsky closed 11 months ago
@gregolsky can you clarify the impacted vm is running on 2.10.0.3 and not 2.10.0.2. You mentioned this
Distro and Version: Ubuntu 22.04 WALinuxAgent version 2.10.0.2
but logs indicate is 2.10.0.3.
Can you also attach following logs to nnandigam@microsoft.com /var/log/waagent.log /var/log/syslog
or you can run the log collector which will collect our agent logs and below command only works if 2.8.0.11 egg still in the vm
root# python3 -u /var/lib/waagent/WALinuxAgent-2.8.0.11/bin/WALinuxAgent-2.8.0.11-py2.7.egg -collect-logs
Running log collector mode normal
Log collection successfully completed. Archive can be found at /var/lib/waagent/logcollector/logs.zip and detailed log output can be found at /var/lib/waagent/logcollector/results.txt
Sorry, I fixed the version - it's 2.10.0.3 for sure. I'll collect the info for you and send it tomorrow.
We have multiple VMs affected this way - for each one CPU increased is correlated with walinuxagent
update. Is there a workaround for the time being? Can we downgrade the version to 2.9?
@nagworld9 I sent you the files.
[like] Nageswara Nandigam (HE/HIM) reacted to your message:
From: Grzegorz Lachowski @.> Sent: Monday, October 9, 2023 7:36:22 PM To: Azure/WALinuxAgent @.> Cc: Mention @.>; Comment @.> Subject: Re: [Azure/WALinuxAgent] [BUG] CPU credits drain on B1ls after update to 2.10.0.3 (Issue #2940)
@nagworld9https://github.com/nagworld9 I sent you the files.
— Reply to this email directly, view it on GitHubhttps://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1753599615 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUERSKTRDRVCOHGFDRIIMF3X6RG3NBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFU42DKNZWGYZTTAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDCOJTGI2DAMZRHAYKO5DSNFTWOZLSUZRXEZLBORSQ. You are receiving this email because you were mentioned.
Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@gregolsky I don't see anything different agent doing in 2.10.0.3 from the log compared to 2.9.1.1. Btw, how do you compute the CPU usage for the agent, how the above graph produced. Is there any chance you combing the last computed 2.9.1.1 value to this new value of 2.10.0.3 after update?
do you look at the agent cgroups reported cpu usage. If so, how the output of below command looks like and can also run pstree. I want to check no unknown process in the cgroup
systemd-cgls --unit system.slice --all
pstree -p
As far as downgrade, today we don't have an option to downgrade.
after update we see CPU usage increase by 2%, which then causes CPU credits drain on small burstable instances e.g. B1ls where 5% is the baseline
In terms of percentage, it's not significant bump. When you say 5% is baseline, how this value calculated and who defines this?
Azure defines this. It's the smallest instance type in B-series. B1ls
For that instance 2% is a significant bump.
I just correlated the average CPU usage bump time with the graph on azure. It matches on at multiple instances.
pon., 9 paź 2023, 22:52 użytkownik Nageswara Nandigam < @.***> napisał:
@gregolsky https://github.com/gregolsky I don't see anything different agent doing in 2.10.0.3 from the log compared to 2.9.1.1. Btw, how do you compute the CPU usage for the agent, how the above graph produced. Is there any change combing the last computed 2.9.1.1 value to this new value of 2.10.0.3 after update?
do you look at the agent cgroups reported cpu usage. If so, how the output of below command looks like and can also run pstree. I want to check no unknown process in the cgroup
systemd-cgls --unit system.slice --all pstree -p
As far as downgrade, today we don't have an option to downgrade.
after update we see CPU usage increase by 2%, which then causes CPU credits drain on small burstable instances e.g. B1ls where 5% is the baseline
In terms of percentage, it's not significant bump. When you say 5% is baseline, how this value calculated and who defines this?
— Reply to this email directly, view it on GitHub https://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1753832658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLNTO6I2GSMLGW2HBO3X6RPZFAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTHAZTENRVHA . You are receiving this because you were mentioned.Message ID: @.***>
https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-b-series-burstable
This bug renders B1ls unusable since when it goes over 5% for some time it consumes all the CPU credits. These instances needs to stay below 5% for the most time to be usable.
wt., 10 paź 2023, 05:24 użytkownik Grzegorz Lachowski < @.***> napisał:
Azure defines this. It's the smallest instance type in B-series. B1ls
For that instance 2% is a significant bump.
I just correlated the average CPU usage bump time with the graph on azure. It matches on at multiple instances.
pon., 9 paź 2023, 22:52 użytkownik Nageswara Nandigam < @.***> napisał:
@gregolsky https://github.com/gregolsky I don't see anything different agent doing in 2.10.0.3 from the log compared to 2.9.1.1. Btw, how do you compute the CPU usage for the agent, how the above graph produced. Is there any change combing the last computed 2.9.1.1 value to this new value of 2.10.0.3 after update?
do you look at the agent cgroups reported cpu usage. If so, how the output of below command looks like and can also run pstree. I want to check no unknown process in the cgroup
systemd-cgls --unit system.slice --all pstree -p
As far as downgrade, today we don't have an option to downgrade.
after update we see CPU usage increase by 2%, which then causes CPU credits drain on small burstable instances e.g. B1ls where 5% is the baseline
In terms of percentage, it's not significant bump. When you say 5% is baseline, how this value calculated and who defines this?
— Reply to this email directly, view it on GitHub https://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1753832658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLNTO6I2GSMLGW2HBO3X6RPZFAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTHAZTENRVHA . You are receiving this because you were mentioned.Message ID: @.***>
@gregolsky do you happen to have other distro type and vm size and having increase in CPU usage after update? is it only on these instances?
I have other instances (all Ubuntu 22.04 though), but here it was very easy to notice, because it lost all CPU credits and stopped responding in a timely fashion.
We're using latest Ubuntu 22.04 image on Azure.
wt., 10 paź 2023, 19:55 użytkownik Nageswara Nandigam < @.***> napisał:
@gregolsky https://github.com/gregolsky do you happen to have other distro type and vm size and having increase in CPU usage after update? is it only on these instances?
— Reply to this email directly, view it on GitHub https://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1755951819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLLMQ2FHELZF5XNSBSDX6WDZRAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVHE2TCOBRHE . You are receiving this because you were mentioned.Message ID: @.***>
What were the changes between 2.9 and 2.10 that may have had such an impact on CPU usage?
wt., 10 paź 2023, 20:06 użytkownik Grzegorz Lachowski < @.***> napisał:
I have other instances (all Ubuntu 22.04 though), but here it was very easy to notice, because it lost all CPU credits and stopped responding in a timely fashion.
We're using latest Ubuntu 22.04 image on Azure.
wt., 10 paź 2023, 19:55 użytkownik Nageswara Nandigam < @.***> napisał:
@gregolsky https://github.com/gregolsky do you happen to have other distro type and vm size and having increase in CPU usage after update? is it only on these instances?
— Reply to this email directly, view it on GitHub https://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1755951819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLLMQ2FHELZF5XNSBSDX6WDZRAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVHE2TCOBRHE . You are receiving this because you were mentioned.Message ID: @.***>
we did rewrite around agent update, but that code path is not enabled in 2.10.0.3. The logs indicate that agent did what was expected.
Is the percentage you have shown for CPU usage is based on 1cpu core as 100% right? Agent taking more than 5% out 100% on 1 core cpu?
Not sure how much CPU agent's process alone took before, but our whole workload in idle state (all processes on the VM together) were around 3-5% total and it was below the baseline. After walinuxagent update it's above the baseline.
B1ls has only 1 vcpu and once it's out of CPU credits, its initial baseline is all you can use, so it shows 100% usage at all times then. However I'll try to find a VM still having some credits to check how the CPU usage patterns look. Do you have any preference on how I should collect these metrics or perform further analysis?
I'm also open for a screensharing session if that would help.
wt., 10 paź 2023 o 21:06 Nageswara Nandigam @.***> napisał(a):
we did rewrite around agent update, but that code path is not enabled in 2.10.0.3. The logs indicate that agent did what was expected.
Is the percentage you have shown for CPU usage is based on 1cpu core as 100% right? Agent taking more than 5% out 100% on 1 core cpu?
— Reply to this email directly, view it on GitHub https://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1756067052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLPQFTCBV2DJYBH3JGDX6WMDTAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJWGA3DOMBVGI . You are receiving this because you were mentioned.Message ID: @.***>
@gregolsky We have identified the code line which is taking bit of time, that's causing increase in CPU usage. While we work on fix, we removed the version from artifacts repository, so that new vms won't get it. In your case, in order to rollback to 2.9.1.1, Please run the following commands. Add comments in brackets. Let me know if you need help there, I can shadow you.
systemctl stop walinuxagent (stop the agent service)
cd /var/lib/waagent/ (Go to agent pkg folder)
ls (you will see 2.10.0.3 packages in zip and folder like WALinuxAgent-2.10.0.3 and WALinuxAgent-2.10.0.3.zip)
rm -rf WALinuxAgent-2.10.* (remove 2.10 pkgs from the vm)
ls (make sure 2.10. got deleted)
systemctl restart walinuxagent (restart agent, so that it will pick up 2.9.1.1 as latest)
waagent --version (check the version in the output and something like this "Goal state agent: 2.9.1.1")
That's great, I'm going to try it out today and get back to you.
czw., 12 paź 2023, 03:51 użytkownik Nageswara Nandigam < @.***> napisał:
@gregolsky https://github.com/gregolsky We have identified the code line which is taking bit of time, that's causing increase in CPU usage. While we work on fix, we removed the version from artifacts repository, so that new vms won't get it. In your case, in order to rollback to 2.9.1.1, Please run the following commands. Add comments in brackets. Let me know if you need help there, I can shadow you.
systemctl stop walinuxagent (stop the agent service) cd /var/lib/waagent/ (Go to agent pkg folder) ls (you will see 2.10.0.3 packages in zip and folder like WALinuxAgent-2.10.0.3 and WALinuxAgent-2.10.0.3.zip) rm -rf WALinuxAgent-2.10.* (remove 2.10 pkgs from the vm) ls (make sure 2.10. got deleted) systemctl restart walinuxagent (restart agent, so that it will pick up 2.9.1.1 as latest) waagent --version (check the version in the output and something like this "Goal state agent: 2.9.1.1")
— Reply to this email directly, view it on GitHub https://github.com/Azure/WALinuxAgent/issues/2940#issuecomment-1758789691, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYNLO5TSLH2OFFQWYX5TDX65ELFANCNFSM6AAAAAA5YJ2EEY . You are receiving this because you were mentioned.Message ID: @.***>
We applied downgrade instructions and they worked. Thank you.
Distro and WALinuxAgent details (please complete the following information):
Additional context Add any other context about the problem here.
Log file attached I am afraid of sharing the log publicly. Please let me know where I can upload it in a private and secure manner.
Last few days of logs is: