Azure / Guest-Configuration-Extension

Azure Guest Configuration Virtual Machine Extension for Linux
Apache License 2.0
12 stars 9 forks source link

On long running VM sometimes guest assignments disappear from Azure portal, gc worker shuts down and never resumes #164

Open eehret opened 2 months ago

eehret commented 2 months ago

I'm not 100% sure what causes this, but we've seen it in quite a few Azure virtual machines.

Something seems to be causing gc worker to think that there is nothing left to do and it just stops. All of the existing guest assignments eventually disappear from Azure portal because gc worker is no longer reporting compliance and the guest assignments expire. Even the 'AzureLinuxBaseline' guest assignments will disappear.

Here's what the tail end of the gc_worker.log looks like when this condition gets triggered:

[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [GC_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] Agent is waiting for running operations to finish. Current active operation count is 0.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [GC_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] All the active operations are finished.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [CONSISTENCY_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] Deleting consistency operation context.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [CONSISTENCY_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] Consistency operation context deleted successfully.
[2024-04-23 12:52:04.935] [PID 3254671] [TID 3254672] [GCCACHE_OPERATIONS] [INFO] [00000000-0000-0000-0000-000000000000] gc cache context deleted successfully.
[2024-04-23 12:52:05.235] [PID 3254671] [TID 3254671] [PSPROVIDER] [INFO] [00000000-0000-0000-0000-000000000000] Cleanup(). shutdownCoreClr() is successful

When this happens it is also impossible to restart gcd.service. it hangs, waiting for a process forever.

# systemctl status gcd
● gcd.service - GC Service
     Loaded: loaded (/lib/systemd/system/gcd.service; enabled; vendor preset: enabled)
     Active: deactivating (stop-sigterm) since Tue 2024-05-07 20:02:33 UTC; 6s ago
   Main PID: 919448 (bash)
      Tasks: 43 (limit: 19179)
     Memory: 271.0M
     CGroup: /system.slice/gcd.service
             ├─919448 /bin/bash /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.26.64/GCAgent/GC/run_service.sh
             └─919449 /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.26.64/GCAgent/GC/gc_linux_service

May 07 20:02:33 vm-dtb-aid-sg2-pr-02 systemd[1]: Stopping GC Service...
May 07 20:02:33 vm-dtb-aid-sg2-pr-02 bash[919448]: Stopping service process 919449
May 07 20:02:33 vm-dtb-aid-sg2-pr-02 bash[919448]: Waiting for service pid=919449
Warning: journal has been rotated since unit was started, output may be incomplete.

The only way I've found so far to recover from this, short of a reboot which we can't do whenever we want in a production environment, is to manually issue a kill on the process that it's waiting forever for.

I'm at a loss as to how to further troubleshoot this. If Microsoft would like more information on this or work with me to gather more information, I can make myself available.

In this specific instance the OS was Ubuntu 20.04 LTS (Azure marketplace image from Canonical)

I think I've found a workaround that might be feasible for us until this issue gets properly resolved -- creating an entry in /etc/cron.daily to restart gcd before it has a chance to reach the 'hung' state. So far that seems to have helped on the one VM where I've tried it.

eehret commented 1 month ago

I noticed that this also seems to happen when I delete guest assignments via Azure Portal, even if some guest assignments are remaining for the same virtual machine, and in theory the worker should still have work to do.