canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 883 forks source link

fstab entries written by cloud-config may not be mounted #2892

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1691489

Launchpad details
affected_projects = ['cloud-init (Ubuntu)', 'cloud-init (Ubuntu Xenial)', 'cloud-init (Ubuntu Yakkety)', 'cloud-init (Ubuntu Zesty)', 'cloud-init (Ubuntu Artful)']
assignee = None
assignee_name = None
date_closed = 2021-01-04T14:52:27.855191+00:00
date_created = 2017-05-17T14:21:07.391494+00:00
date_fix_committed = 2021-01-04T14:52:27.855191+00:00
date_fix_released = 2021-01-04T14:52:27.855191+00:00
id = 1691489
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1691489
milestone = None
owner = smoser
owner_name = Scott Moser
private = False
status = fix_released
submitter = smoser
submitter_name = Scott Moser
tags = ['verification-done-xenial', 'verification-done-zesty']
duplicates = []

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:21:07.391494+00:00

=== Begin SRU Template === [Impact] There is a race condition on a re-deployment of cloud-init on Azure where /mnt will not get properly formatted or mounted. This is due to "dirty" entries in /etc/fstab that cause a device to be busy when cloud-init goes to format it. This shows itself usually as 'mkfs' complaining that the device is busy. The cause is that systemd starts an fsck and collides with cloud-init re-formatting the disk.

The problem can be seen other places but seemed to be most reproducible and originally found on Azure.

[Test Case] 1.) Launch a Azure vm, ideally size L32S. 2.) Log in and verify the system properly mounted /mnt. 3.) Re-deploy the vm through the web ui and try again.

[Regression Potential] Worst case scenario, these changes unnecessarily slow down boot and do not fix the problem.

[Regression] This SRU change caused bug 1717477.

[Other Info] Upstream commit at   https://git.launchpad.net/cloud-init/commit/?id=1f5489c258

=== End SRU Template ===

As reported in bug 1686514, sometimes /mnt will not get mounted when re-delpoying or stopping-then-starting a Azure vm of L32S. This is probably a more generic issue, I suspect shown due to the speed of disks on these systems.

Related bugs:  * bug 1686514: Azure: cloud-init does not handle reformatting GPT partition ephemeral disks

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:35:56.773714+00:00

These tarballs are collected with 'save-old-data' at https://git.launchpad.net/~smoser/cloud-init/+git/sru-info/tree/bin

They represent: orig-boot.tar.xz: the first boot of a 16.04 pristine image (0.7.9-90-g61eb03fe-0ubuntu1~16.04.1) upgrade-first-reboot.tar.xz: I did a dpkg -i of cloud-init_0.7.9-139-gb5722bd1-1~bddeb_all.deb (current branch with fix for bug 1686514) after-restart.tar.xz: After a 'stop' and then 'start' in the web console. This showed the bug. after-restart-with-fsck.tar.xz: dpkg -i of a another branch cloud-init_0.7.9-140-g2e21a411-1~bddeb_all.deb and stop and start.

Launchpad attachments: orig-boot.tar.xz

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:36:31.886469+00:00

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:37:18.877151+00:00

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:42:54.358658+00:00

Launchpad attachments: upgrade-first-reboot.tar.xz

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:43:07.246888+00:00

Launchpad attachments: after-restart.tar.xz

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:43:18.861899+00:00

Launchpad attachments: after-restart-with-fsck.tar.xz

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-19T18:36:23.749601+00:00

It seems that in addition to blocking fsck, we should also block swap usage. The severity of this issue is somewhat limited as the scenario will only happen when a.) there is a filesystem (or swap) on a disk b.) there is a (likely stale) entry in /etc/fstab for that disk already

This means that we're kind of limited to either

  1. azure instances and resize/redeploy
  2. first boot of a an instance snapshootted with stuff in /etc/fstab
  3. developer testing (re-partition/setup and rm -Rf /var/lib/cloud && reboot)
ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-19T18:38:11.262026+00:00

Dimitri,

Do you know how I can limit swap usage until after cloud-init.service is done? I'm under the impression that I can do that with fsck by adding the drop-in to /systemd/systemd-fsck@.service.d/cloud-init.conf as seen in the merge proposal.

I'm open to other ideas too.

ubuntu-server-builder commented 1 year ago

Launchpad user Balint Reczey(rbalint) wrote on 2017-06-19T16:32:28.975829+00:00

I tried finding other options, but to work around /etc/fstab containing potentially invalid swap partition the only options seems to be calling "swapoff -a" and then later "swapon -a" from cloud-init when it detects that a partition re-initialization needs to take place.

The same stands for systemd-fsckd.service. IMO it should be stopped for the time reformatting takes place instead of adding the drop-in which would potentially slow down boot even when this workaround is not needed.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-06-21T18:22:59.008357+00:00

Balint,

Thanks for the reply.

With regard to slowing down boot, I'm not too concerned about that. Because in almost all properly functioning scenarios, cloud-init's generator will enable or disable cloud-init. So the slow down would be limited to scenarios where cloud-init was supposed to run, primarily on non-first boots of an instance. I agree though, it does put a bottleneck in boot.

With reard to 'swapoff -a' or 'swapon -a' or the systemd-fsck.service equivalent, I'm not opposed to that, but I don't know how it could be made to be non-racey. Do you have a solution in mind that doesn't have a race in it?

Ie, for swap:

while systemd in parallel

This can be mitigated some by being more granular (swapoff /dev/XXX), but still racy unless cloud-init can coordinate that with systemd. Is that possible?

Thanks again for the input. Scott

ubuntu-server-builder commented 1 year ago

Launchpad user Balint Reczey(rbalint) wrote on 2017-06-23T22:48:26.408976+00:00

I filed a merge request to limit the fsck delay to Azure, please take a look at it.

Regarding the swap I think the least hack-ish safe solution would be relying on systemd-fstab-generator to create the .swap units as usual, and instead of running swapoff/swapon cloud init could find all .swap units and stop them for the time it does things.

That would avoid the race because the generator runs early, before the units, and stopping .swap units is done by systemd.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-07-31T14:37:05.899465+00:00

This bug was fixed in the package cloud-init - 0.7.9-231-g80bf98b9-0ubuntu1


cloud-init (0.7.9-231-g80bf98b9-0ubuntu1) artful; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user Chris J Arges(arges) wrote on 2017-08-23T12:27:59.297447+00:00

Hello Scott, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-233-ge586fe35-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

ubuntu-server-builder commented 1 year ago

Launchpad user Chris J Arges(arges) wrote on 2017-08-23T12:31:23.456563+00:00

Hello Scott, or anyone else affected,

Accepted cloud-init into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-233-ge586fe35-0ubuntu1~17.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2017-09-12T22:02:42.384457+00:00

Validated across multiple (5) 'clean' reboots that Azure vms don't hit the race condition with mounts and don't result in cloud-init errors.

ubuntu@xen1:~$ dpkg-query --show cloud-init cloud-init 0.7.9-233-ge586fe35-0ubuntu1~16.04.1 ubuntu@xen1:~$ grep -i error /var/log/cloud-init.log /run/cloud-init/* grep: /run/cloud-init/dhclient.hooks: Is a directory /run/cloud-init/result.json: "errors": [] /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], ubuntu@xen1:~$ cat /run/cloud-init/result.json { "v1": { "datasource": "DataSourceAzure [seed=/var/lib/waagent]", "errors": [] } } ubuntu@xen1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 21:58:12,526 - DataSourceAzure.py[DEBUG]: reformattable=False: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was not ntfs formatted

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2017-09-12T22:20:55.710927+00:00

Zesty verification:

Saw initial failure before upgrade

ubuntu@zesty1:~$ dpkg-query --show cloud-init cloud-init 0.7.9-153-g16a7302f-0ubuntu1~17.04.2 ubuntu@zesty1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 22:08:16,313 - DataSourceAzure.py[DEBUG]: reformattable=True: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was ntfs formatted and had no important files. Safe for reformatting.

Saw 5 successes across reprovisions after upgrade

ubuntu@zesty1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 22:19:39,881 - DataSourceAzure.py[DEBUG]: reformattable=False: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was not ntfs formatted ubuntu@zesty1:~$ mount | grep mnt /dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered) ubuntu@zesty1:~$ dpkg-query --show cloud-init cloud-init 0.7.9-233-ge586fe35-0ubuntu1~17.04.1 ubuntu@zesty1:~$ grep -i error /var/log/cloud-init /run/cloud-init/ grep: /run/cloud-init/dhclient.hooks: Is a directory /run/cloud-init/result.json: "errors": [] /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], ubuntu@zesty1:~$ cat /run/cloud-init/result.json { "v1": { "datasource": "DataSourceAzure [seed=/var/lib/waagent]", "errors": [] } }

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-09-13T01:26:05.837714+00:00

This bug was fixed in the package cloud-init - 0.7.9-233-ge586fe35-0ubuntu1~16.04.1


cloud-init (0.7.9-233-ge586fe35-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user Chris Halse Rogers(raof) wrote on 2017-09-13T01:26:37.254138+00:00

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-09-13T01:27:27.937540+00:00

This bug was fixed in the package cloud-init - 0.7.9-233-ge586fe35-0ubuntu1~17.04.1


cloud-init (0.7.9-233-ge586fe35-0ubuntu1~17.04.1) zesty; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user thermoman(thermoman) wrote on 2017-09-15T09:57:01.003524+00:00

This release broke a lot of my machines, generating ordering cycles on every machine.

Please see #1717477

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-09-15T18:36:54.000612+00:00

Not sure what to do here. We intend to fix the other bug (bug 1717477) by reverting this change. Thus re-opening this bug.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2017-09-15T21:34:26.400732+00:00

As far as I can tell, I don't think we can "delay" the fsck service due to how the systemd-fstab-generator works on /etc/fstab entries

For entries with a no-zero value for fsck (6th column), then the generator will write out a .mount file that looks like this:

ubuntu@ubuntu:/run/systemd/generator$ cat btrfs.mount

Automatically generated by systemd-fstab-generator

[Unit] SourcePath=/etc/fstab Documentation=man:fstab(5) man:systemd-fstab-generator(8) Before=local-fs.target Requires=systemd-fsck@dev-disk-by\x2duuid-d8e33db0\x2d9a54\x2d11e7\x2dbd8f\x2d525400123456.service After=systemd-fsck@dev-disk-by\x2duuid-d8e33db0\x2d9a54\x2d11e7\x2dbd8f\x2d525400123456.service

[Mount] What=/dev/disk/by-uuid/d8e33db0-9a54-11e7-bd8f-525400123456 Where=/btrfs Type=btrfs

This will want to run fsck on the device, and then mount it, and all before local-fs.target

cloud-init cannot run until after local-fs.target is reached. Asking fsck service to run later is always going to be in-conflict with fsck+mount from the generator.

I'm not sure we can reliably interrupt these services; the .mount unit is going to require a fsck; if we stop the fsck, then the mount won't happen.

This is going to require some more thought and discussion.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-09-23T02:32:44.341069+00:00

This bug is believed to be fixed in cloud-init in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.