Closed ubuntu-server-builder closed 1 year ago
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:35:56.773714+00:00
These tarballs are collected with 'save-old-data' at https://git.launchpad.net/~smoser/cloud-init/+git/sru-info/tree/bin
They represent: orig-boot.tar.xz: the first boot of a 16.04 pristine image (0.7.9-90-g61eb03fe-0ubuntu1~16.04.1) upgrade-first-reboot.tar.xz: I did a dpkg -i of cloud-init_0.7.9-139-gb5722bd1-1~bddeb_all.deb (current branch with fix for bug 1686514) after-restart.tar.xz: After a 'stop' and then 'start' in the web console. This showed the bug. after-restart-with-fsck.tar.xz: dpkg -i of a another branch cloud-init_0.7.9-140-g2e21a411-1~bddeb_all.deb and stop and start.
Launchpad attachments: orig-boot.tar.xz
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:36:31.886469+00:00
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:37:18.877151+00:00
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:42:54.358658+00:00
Launchpad attachments: upgrade-first-reboot.tar.xz
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:43:07.246888+00:00
Launchpad attachments: after-restart.tar.xz
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:43:18.861899+00:00
Launchpad attachments: after-restart-with-fsck.tar.xz
Launchpad user Scott Moser(smoser) wrote on 2017-05-19T18:36:23.749601+00:00
It seems that in addition to blocking fsck, we should also block swap usage. The severity of this issue is somewhat limited as the scenario will only happen when a.) there is a filesystem (or swap) on a disk b.) there is a (likely stale) entry in /etc/fstab for that disk already
This means that we're kind of limited to either
Launchpad user Scott Moser(smoser) wrote on 2017-05-19T18:38:11.262026+00:00
Dimitri,
Do you know how I can limit swap usage until after cloud-init.service is done? I'm under the impression that I can do that with fsck by adding the drop-in to /systemd/systemd-fsck@.service.d/cloud-init.conf as seen in the merge proposal.
I'm open to other ideas too.
Launchpad user Balint Reczey(rbalint) wrote on 2017-06-19T16:32:28.975829+00:00
I tried finding other options, but to work around /etc/fstab containing potentially invalid swap partition the only options seems to be calling "swapoff -a" and then later "swapon -a" from cloud-init when it detects that a partition re-initialization needs to take place.
The same stands for systemd-fsckd.service. IMO it should be stopped for the time reformatting takes place instead of adding the drop-in which would potentially slow down boot even when this workaround is not needed.
Launchpad user Scott Moser(smoser) wrote on 2017-06-21T18:22:59.008357+00:00
Balint,
Thanks for the reply.
With regard to slowing down boot, I'm not too concerned about that. Because in almost all properly functioning scenarios, cloud-init's generator will enable or disable cloud-init. So the slow down would be limited to scenarios where cloud-init was supposed to run, primarily on non-first boots of an instance. I agree though, it does put a bottleneck in boot.
With reard to 'swapoff -a' or 'swapon -a' or the systemd-fsck.service equivalent, I'm not opposed to that, but I don't know how it could be made to be non-racey. Do you have a solution in mind that doesn't have a race in it?
Ie, for swap:
while systemd in parallel
This can be mitigated some by being more granular (swapoff /dev/XXX), but still racy unless cloud-init can coordinate that with systemd. Is that possible?
Thanks again for the input. Scott
Launchpad user Balint Reczey(rbalint) wrote on 2017-06-23T22:48:26.408976+00:00
I filed a merge request to limit the fsck delay to Azure, please take a look at it.
Regarding the swap I think the least hack-ish safe solution would be relying on systemd-fstab-generator to create the .swap units as usual, and instead of running swapoff/swapon cloud init could find all .swap units and stop them for the time it does things.
That would avoid the race because the generator runs early, before the units, and stopping .swap units is done by systemd.
Launchpad user Launchpad Janitor(janitor) wrote on 2017-07-31T14:37:05.899465+00:00
This bug was fixed in the package cloud-init - 0.7.9-231-g80bf98b9-0ubuntu1
cloud-init (0.7.9-231-g80bf98b9-0ubuntu1) artful; urgency=medium
New upstream snapshot.
-- Scott Moser smoser@ubuntu.com Mon, 31 Jul 2017 09:47:34 -0400
Launchpad user Chris J Arges(arges) wrote on 2017-08-23T12:27:59.297447+00:00
Hello Scott, or anyone else affected,
Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-233-ge586fe35-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.
Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.
Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!
Launchpad user Chris J Arges(arges) wrote on 2017-08-23T12:31:23.456563+00:00
Hello Scott, or anyone else affected,
Accepted cloud-init into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-233-ge586fe35-0ubuntu1~17.04.1 in a few hours, and then in the -proposed repository.
Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.
Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!
Launchpad user Chad Smith(chad.smith) wrote on 2017-09-12T22:02:42.384457+00:00
Validated across multiple (5) 'clean' reboots that Azure vms don't hit the race condition with mounts and don't result in cloud-init errors.
ubuntu@xen1:~$ dpkg-query --show cloud-init cloud-init 0.7.9-233-ge586fe35-0ubuntu1~16.04.1 ubuntu@xen1:~$ grep -i error /var/log/cloud-init.log /run/cloud-init/* grep: /run/cloud-init/dhclient.hooks: Is a directory /run/cloud-init/result.json: "errors": [] /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], ubuntu@xen1:~$ cat /run/cloud-init/result.json { "v1": { "datasource": "DataSourceAzure [seed=/var/lib/waagent]", "errors": [] } } ubuntu@xen1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 21:58:12,526 - DataSourceAzure.py[DEBUG]: reformattable=False: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was not ntfs formatted
Launchpad user Chad Smith(chad.smith) wrote on 2017-09-12T22:20:55.710927+00:00
Zesty verification:
ubuntu@zesty1:~$ dpkg-query --show cloud-init cloud-init 0.7.9-153-g16a7302f-0ubuntu1~17.04.2 ubuntu@zesty1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 22:08:16,313 - DataSourceAzure.py[DEBUG]: reformattable=True: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was ntfs formatted and had no important files. Safe for reformatting.
ubuntu@zesty1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 22:19:39,881 - DataSourceAzure.py[DEBUG]: reformattable=False: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was not ntfs formatted ubuntu@zesty1:~$ mount | grep mnt /dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered) ubuntu@zesty1:~$ dpkg-query --show cloud-init cloud-init 0.7.9-233-ge586fe35-0ubuntu1~17.04.1 ubuntu@zesty1:~$ grep -i error /var/log/cloud-init /run/cloud-init/ grep: /run/cloud-init/dhclient.hooks: Is a directory /run/cloud-init/result.json: "errors": [] /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], /run/cloud-init/status.json: "errors": [], ubuntu@zesty1:~$ cat /run/cloud-init/result.json { "v1": { "datasource": "DataSourceAzure [seed=/var/lib/waagent]", "errors": [] } }
Launchpad user Launchpad Janitor(janitor) wrote on 2017-09-13T01:26:05.837714+00:00
This bug was fixed in the package cloud-init - 0.7.9-233-ge586fe35-0ubuntu1~16.04.1
cloud-init (0.7.9-233-ge586fe35-0ubuntu1~16.04.1) xenial-proposed; urgency=medium
New upstream snapshot.
-- Scott Moser smoser@ubuntu.com Mon, 31 Jul 2017 16:36:16 -0400
Launchpad user Chris Halse Rogers(raof) wrote on 2017-09-13T01:26:37.254138+00:00
The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.
Launchpad user Launchpad Janitor(janitor) wrote on 2017-09-13T01:27:27.937540+00:00
This bug was fixed in the package cloud-init - 0.7.9-233-ge586fe35-0ubuntu1~17.04.1
cloud-init (0.7.9-233-ge586fe35-0ubuntu1~17.04.1) zesty; urgency=medium
New upstream snapshot.
-- Scott Moser smoser@ubuntu.com Mon, 31 Jul 2017 16:33:24 -0400
Launchpad user thermoman(thermoman) wrote on 2017-09-15T09:57:01.003524+00:00
This release broke a lot of my machines, generating ordering cycles on every machine.
Please see #1717477
Launchpad user Scott Moser(smoser) wrote on 2017-09-15T18:36:54.000612+00:00
Not sure what to do here. We intend to fix the other bug (bug 1717477) by reverting this change. Thus re-opening this bug.
Launchpad user Ryan Harper(raharper) wrote on 2017-09-15T21:34:26.400732+00:00
As far as I can tell, I don't think we can "delay" the fsck service due to how the systemd-fstab-generator works on /etc/fstab entries
For entries with a no-zero value for fsck (6th column), then the generator will write out a .mount file that looks like this:
ubuntu@ubuntu:/run/systemd/generator$ cat btrfs.mount
[Unit] SourcePath=/etc/fstab Documentation=man:fstab(5) man:systemd-fstab-generator(8) Before=local-fs.target Requires=systemd-fsck@dev-disk-by\x2duuid-d8e33db0\x2d9a54\x2d11e7\x2dbd8f\x2d525400123456.service After=systemd-fsck@dev-disk-by\x2duuid-d8e33db0\x2d9a54\x2d11e7\x2dbd8f\x2d525400123456.service
[Mount] What=/dev/disk/by-uuid/d8e33db0-9a54-11e7-bd8f-525400123456 Where=/btrfs Type=btrfs
This will want to run fsck on the device, and then mount it, and all before local-fs.target
cloud-init cannot run until after local-fs.target is reached. Asking fsck service to run later is always going to be in-conflict with fsck+mount from the generator.
I'm not sure we can reliably interrupt these services; the .mount unit is going to require a fsck; if we stop the fsck, then the mount won't happen.
This is going to require some more thought and discussion.
Launchpad user Scott Moser(smoser) wrote on 2017-09-23T02:32:44.341069+00:00
This bug is believed to be fixed in cloud-init in 17.1. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
This bug was originally filed in Launchpad as LP: #1691489
Launchpad details
Launchpad user Scott Moser(smoser) wrote on 2017-05-17T14:21:07.391494+00:00
=== Begin SRU Template === [Impact] There is a race condition on a re-deployment of cloud-init on Azure where /mnt will not get properly formatted or mounted. This is due to "dirty" entries in /etc/fstab that cause a device to be busy when cloud-init goes to format it. This shows itself usually as 'mkfs' complaining that the device is busy. The cause is that systemd starts an fsck and collides with cloud-init re-formatting the disk.
The problem can be seen other places but seemed to be most reproducible and originally found on Azure.
[Test Case] 1.) Launch a Azure vm, ideally size L32S. 2.) Log in and verify the system properly mounted /mnt. 3.) Re-deploy the vm through the web ui and try again.
[Regression Potential] Worst case scenario, these changes unnecessarily slow down boot and do not fix the problem.
[Regression] This SRU change caused bug 1717477.
[Other Info] Upstream commit at https://git.launchpad.net/cloud-init/commit/?id=1f5489c258
=== End SRU Template ===
As reported in bug 1686514, sometimes /mnt will not get mounted when re-delpoying or stopping-then-starting a Azure vm of L32S. This is probably a more generic issue, I suspect shown due to the speed of disks on these systems.
Related bugs: * bug 1686514: Azure: cloud-init does not handle reformatting GPT partition ephemeral disks