Azure / Guest-Configuration-Extension

Azure Guest Configuration Virtual Machine Extension for Linux
Apache License 2.0
11 stars 7 forks source link

Excessive memory consumption #131

Open Confusedfish opened 2 years ago

Confusedfish commented 2 years ago

I have a number of servers that following an update to the guest configuration assignments (in Azure activity log - Fri 13th 2021 19:50) are all showing signs of a memory leak in the GCD service. One of these servers has become unresponsive twice now due to running out of available memory and needed to be manually deallocated & restarted in Azure to resume production activities.

Manually stopping the GCD service returns the consumed memory and allows the server to resume normal memory consumption.

image

Prior to the recent change the servers had steady memory usage.

image

Is there any reason for this service to be consuming over 1 GB in memory?

image

Can you offer any insight into how to diagnose this issue?

Confusedfish commented 2 years ago

As an update to the above, the other servers that are suffering this issue have more free memory. These servers never reach the out of memory condition as the GDC service seems to restart of it's own accord once it allocates 1.3GB

image

Checking the syslog on this server reveals the following each time there is a release of the memory which might be helpful:

sudo journalctl -u gcd.service

Aug 23 10:30:08 vm-nat bash[27784]: Finished waiting on pid=27785
Aug 23 10:30:18 vm-nat systemd[1]: gcd.service: Service hold-off time over, scheduling restart.
Aug 23 10:30:18 vm-nat systemd[1]: gcd.service: Scheduled restart job, restart counter is at 291.
Aug 23 10:30:18 vm-nat systemd[1]: Stopped GC Service.
Aug 23 10:30:18 vm-nat systemd[1]: Started GC Service.
Aug 23 10:30:18 vm-nat bash[11892]: Start service process
Aug 23 10:30:18 vm-nat bash[11892]: service pid = 11893
Aug 23 10:30:18 vm-nat bash[11892]: Waiting for service pid=11893
mikecowie-seequent commented 2 years ago

Hi @Confusedfish We encountered this issue too, and a support case open, I will direct the agent towards this case.

Indication is that Security policy “AzureLinuxBaseline” has many conditions to audit. Definitely a bug as a supporting agent shouldn't (In my view) under any circumstances be allocating such a high percentage of system memory - rather have it stopping and raising a minor error to the portal.

Confusedfish commented 2 years ago

Thanks for raising the support request. I thought I would try removing the extension from our production server while this was ongoing but now can't figure out how to reinstall it unfortunately.

Did your issue start around the 13th too @mikecowie-seequent ?

mikecowie-seequent commented 2 years ago

@Confusedfish , yes, started 13th, 1700 UTC . We mitigated it by up-sizing the instances.

Confusedfish commented 2 years ago

We use reserved instance sizes so didn't think that was an option. Thanks for confirming it's not just us. Would you please share the outcome of the support request when you hear back?

Nice little money spinner that. Force all your clients to up the size of their VMs

mikecowie-seequent commented 2 years ago

@Confusedfish , cyncical , cynical ;) . I asked them to respond here.

mikecowie-seequent commented 2 years ago

@Confusedfish , oh, another interesting element is that we had the automatic updates of the extension disabled, but you say for you it corresponded to an update to the agent?

...I could spend too long on this rabbit hole, I'm going to drop off until there's a response from MS. :)

Confusedfish commented 2 years ago

I had spotted that too but I guess there is the extension and then there is the configuration of that extension, perhaps the later was changed and not the former. It certainly shows up in the activity log.

I just checked another subscription that has another Linux VM in a different region. It lacks the change on the 13th and doesn't have the issue. Is your problem VM in UK West?

Let's see what MS come back with.

mikecowie-seequent commented 2 years ago

@Confusedfish ,we had it in 5 regions across 4 continents, affected all the regions we had B1s instances.

Confusedfish commented 2 years ago

Thanks for raising it with MS as we lack a support contract. Not sure why it should be necessary to pay to diagnose this sort of thing but then as you already identified I am cynical!

I have it affecting 2xB2s, 1xB1ms & 1xD4ds_v4 so not B1s specific. The Windows instance is unaffected. I have just needed to remove the extension from the B1s instance as that is only a 2GB machine. A process that starts swallowing 1.3GB doesn't take long to knock it offline (as you well know!)

Auto updates to the extension are also disabled for us but this is what I was meaning in the Activity Log which on this machine was at 19:00 UTC so a couple of hours after yours. Other servers have slightly different times for this but all about the same:

image

mikecowie-seequent commented 2 years ago

I've (once again) requested that MS acknowledge this issue here and work publically rather than the private channel. Unfortunately that repeated request has not been taken up yet.

I have done some more back-and-forth with the MS support agent, but its slow going (I have weird hours to begin with, and a support case for a mitigated issue is never going to be by #1 priority for the week....). Inspec (a chef-related thing) looks to be the engine involved, the amount of tasks/policies it has to audit seems to be related to the issue. More fundamentally, I've expressed that a supporting agent shouldn't be taking a server offline because it allocated too much memory, whatever the underlying problem causing it to allocate too much memory, and pointed at https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html as (to me), an obvious reading for how the immediate problem could be addressed.

Kolky commented 4 months ago

We are having similar issues since 2/3 weeks: htop Raised it via a ticket to Microsoft. Did you @mikecowie-seequent or @Confusedfish resolve this?

Thomas-Verschuere2 commented 3 weeks ago

Was there a solution for this?