Closed rwagnergit closed 1 year ago
@anhvoms could you take a look?
@rwagnergit RHEL 8.1+ and RHEL 7.7+ on Azure should be using cloud-init for provisioning and the resource disk formatting/partitioning should be handled by cloud-init. Cloud-init does a better job at discovering the resource disk by looking at the alias /dev/disk/cloud/resource instead.
Is waagent being deprecated? If not, you should fix it. I gift wrapped the solution
Rob
On Nov 28, 2022, at 4:31 PM, Anh Vo @.***> wrote:
@rwagnergit RHEL 8.1+ and RHEL 7.7+ on Azure should be using cloud-init for provisioning and the resource disk formatting/partitioning should be handled by cloud-init. Cloud-init does a better job at discovering the resource disk by looking at the alias /dev/disk/cloud/resource instead.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
@rwagnergit walinuxagent provisioning is not deprecated. It is, however, considered to be in maintenance mode (we will only release patches for security related bugs)
Describe the bug: A clear and concise description of what the bug is.
On at least one of our Azure Linux VMs, we are seeing waagent is incorrectly identifying the mount point of the ephemeral disk. In our particular case, the ephemeral disk is present at /dev/sda, the OS disk is present ad /dev/sdac, the boot partition is /dev/sdac1 and is mounted at /boot. When waagent starts, get_mount_point() in /usr/lib/python3.6/site-packages/azurelinuxagent/common/osutil/default.py is returning /boot, which is causing waagent to (among other things) try to create the swapfile under /boot, where it runs out of space (since /boot is only 500MB in size and we are attempting to create a 16GB swapfile. I dug into the code and I believe the problem is the regex in get_mount_point() is not sufficiently specific. Instead of:
we should have:
After making that change, waagent functions correctly; get_mount_point() returns None and the ephemeral disk is properly partitioned, mounted, and the swapfile created. I know there's a lot here, so I'll try to explain below, and note that this was discovered as part of Azure case #2206030040004173.
When the problem occurs, here's what we see in waagent.log:
And here is lsblk:
If we look at the relevant portion of azurelinuxagent/daemon/resourcedisk/default.py:
And the relevant portion of azurelinuxagent/common/osutil/default.py:
We can trace the problem:
After making the above change, restarting waagent yields a better waagent.log:
Distro and WALinuxAgent details (please complete the following information):
Additional context I believe a necessary prerequisite for the problem to occur is the Azure VM needs to have more than 26 disks attached to it, so that the /dev/sd* device names roll over to /dev/sdaX (where X is a letter). Until that rollover occurs, a regex looking for /dev/sda will not match something inappropriately. In sum, the existing regex matches on any device name with a PREFIX of /dev/sda, which can only be incorrect once more than 26 drives are attached.
Log file attached I provided the relevant portions of the log, above. That said, if having the entire log is helpful, I'm happy to provide it.