Closed ubuntu-server-builder closed 1 year ago
Launchpad user Billy Olsen(billy-olsen) wrote on 2017-09-19T20:30:56.010514+00:00
Launchpad attachments: cloud-init.tar
Launchpad user Billy Olsen(billy-olsen) wrote on 2017-09-19T20:34:21.370849+00:00
Launchpad attachments: Recreate script for monitoring mountpoints and rebooting
Launchpad user Ryan Harper(raharper) wrote on 2017-09-20T18:19:50.207080+00:00
The current analysis is that any parallel invocation of mount which specifies the same device is going to result in one of the two mounts to fail with EBUSY (or some other failure) do to how the underlying device handles either opens or mounts in the kernel. This is recreatable outside of cloud-init when running mounts in parallel pointing to the same disk.
while true; do
umount -f ${MNT_POINT} |:
mount ${DISK} ${MNT_POINT} -t ext4 -o defaults || {
echo "date +%s.%N
: mount failed on COUNT=$COUNT";
exit 1
}
echo "date +%s.%N
: COUNT=$COUNT" > ${MNT_POINT}/lasttouch
COUNT=$(($COUNT + 1))
done
while true; do dd if=${DISK} bs=512 count=1 of=/dev/null
mount -o ro,sync -t iso9660 ${DISK} ${MNT_POINT} |:
echo "`date +%s.%N`: READ COUNT=$COUNT"
COUNT=$(($COUNT + 1))
done
There isn't currently a general solution to prevent this race from occuring; so the focus is on reducing the likelihood of a race, and configuration changes to be resilient in the face of the race.
The linked changes to cloud-init will reduce the number of block devices that OVF will probe. The OVF Datasource is looking for block devices with ISO9660 filesystems; on an instance that has no ISO9660 filesystems, OVF will not perform any mount operations.
When crafting fstab entries, one can set the 6th column, (fs_passno) to zero (0) and systemd will not attempt to fsck the device prior to mounting. This does not remove the mount race, but it does reduce the amount of time between when the mount unit starts and when it completes.
Launchpad user Ryan Harper(raharper) wrote on 2017-09-21T03:58:26.963034+00:00
Some more details, in particular the ami-id is needed:
https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-1d4e7a66
t2-micro is fine, in us-east1, add 26 volumes, mark for deletion.
Run the script and star the reboot check.
That triggered in a single reboot; Instance is old at this time, comes with linux-image-4.4.0-aws-1030.
I'm currently testing with the image updated; current linux image is 4.4.0-aws-1035; That's not reproducing as quickly, so far 25 reboots with no issues.
Separately on an a non-upgraded instance with the above ami and instance type as described, I've packaged and installed a cloud-init package from the linked branch; that's successfully run for 68 reboots with no issues.
i'll continue running these over night and update this in the morning.
Launchpad user Ryan Harper(raharper) wrote on 2017-09-21T14:15:40.493129+00:00
Instance on 4.4.0-aws-1035 has 525 boots with no issues/no changes to cloud-init Instance on 4.4.0-aws-1030 has 650 boots with no issues/updated cloud-init
I believe it's still required to have the cloud-init changes; however, due to changes in the kernel/systemd or otherwise; systemd mounts are not interleaved with cloud-init probing. This behavior is very consistent within a specific kernel/systemd pair.
For example on the latest boot of 4.4.0-aws-1035, I can check the timestamps (first and last) of the mounts (ignoring xvda/rootfs)
% $ journalctl --utc -o short-precise | grep mounted.filesystem | grep -v xvda | sed -e 1b -e '$!d'
Sep 21 13:56:13.858878 ip-172-31-27-63 kernel: EXT4-fs (xvdr): mounted filesystem with ordered data mode. Opts: (null) Sep 21 13:56:14.129499 ip-172-31-27-63 kernel: EXT4-fs (xvdo): mounted filesystem with ordered data mode. Opts: (null)
And check the most recent cloud-init boot for when OVF search starts/stops
cloud-init-local starts OVF searching at: 2017-09-21 13:56:09,617 - handlers.py[DEBUG]: start: init-local/search-OVF: searching for local data from DataSourceOVF
Finishes at: 2017-09-21 13:56:10,322 - handlers.py[DEBUG]: finish: init-local/search-OVF: SUCCESS: no local data found from DataSourceOVF
So we probed before systemd mounts
Then at cloud-init.service (net mode) time, start at 2017-09-21 13:56:11,650 - handlers.py[DEBUG]: start: init-network/search-OVFNet: searching for network data from DataSourceOVFNet
Finish OVFNet at: 2017-09-21 13:56:12,246 - handlers.py[DEBUG]: finish: init-network/search-OVFNet: SUCCESS: no network data found from DataSourceOVFNet
And OVFNet probes before systemd mounts start at 13:56:13.85
Launchpad user Robin Miller(robincello) wrote on 2017-10-12T19:21:27.956807+00:00
This issue seems to have returned on kernel 4.4.0-1038-aws with cloud-init 0.7.9-233-ge586fe35-0ubuntu1~16.04.2
Launchpad user Ryan Harper(raharper) wrote on 2017-10-17T19:14:05.278745+00:00
Currently validating cloud-init in xenial-proposed which resolves this specific issue with cloud-init racing systemd-fstab entries with the probing during DatasourceOVF.
cloud-init | 17.1-18-gd4f70470-0ubuntu1~16.04.1 | xenial-proposed | source, all
Launchpad user Scott Moser(smoser) wrote on 2017-12-14T21:02:11.259951+00:00
This bug is believed to be fixed in cloud-init in 1705804. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
This bug was originally filed in Launchpad as LP: #1718287
Launchpad details
Launchpad user Billy Olsen(billy-olsen) wrote on 2017-09-19T20:30:56.010514+00:00
[Issue]
After rebooting a 16.04 AWS instance (ami-1d4e7a66) with several external disks attached, formatted, and added to /etc/fstab - systemd mount targets fail to mount with:
● media-v.mount - /media/v Loaded: loaded (/etc/fstab; bad; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2017-09-19 20:12:18 UTC; 1min 54s ago Where: /media/v What: /dev/xvdv Docs: man:fstab(5) man:systemd-fstab-generator(8) Process: 1196 ExecMount=/bin/mount /dev/xvdv /media/v -t ext4 -o defaults (code=exited, status=32)
Sep 19 20:12:17 ip-172-31-7-167 systemd[1]: Mounting /media/v... Sep 19 20:12:17 ip-172-31-7-167 mount[1196]: mount: /dev/xvdv is already mounted or /media/v busy Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: media-v.mount: Mount process exited, code=exited status=32 Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: Failed to mount /media/v. Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: media-v.mount: Unit entered failed state.
From the cloud-init logs, it appears that the the OVF datasource is mounting the device to find data:
2017-09-19 20:12:17,502 - util.py[DEBUG]: Peeking at /dev/xvdv (max_bytes=512) 2017-09-19 20:12:17,502 - util.py[DEBUG]: Reading from /proc/mounts (quiet=False) 2017-09-19 20:12:17,502 - util.py[DEBUG]: Read 2570 bytes from /proc/mounts ... 2017-09-19 20:12:17,506 - util.py[DEBUG]: Running command ['mount', '-o', 'ro,sync', '-t', 'iso9660', '/dev/xvdv', '/tmp/tmpw2tyqqid'] with allowed return codes [0] (shell=False, capture=True) 2017-09-19 20:12:17,545 - util.py[DEBUG]: Failed mount of '/dev/xvdv' as 'iso9660': Unexpected error while running command. Command: ['mount', '-o', 'ro,sync', '-t', 'iso9660', '/dev/xvdv', '/tmp/tmpw2tyqqid'] Exit code: 32 Reason: - Stdout: - Stderr: mount: wrong fs type, bad option, bad superblock on /dev/xvdv, missing codepage or helper program, or other error
2017-09-19 20:12:17,545 - util.py[DEBUG]: Recursively deleting /tmp/tmpw2tyqqid 2017-09-19 20:12:17,545 - DataSourceOVF.py[DEBUG]: /dev/xvdv not mountable as iso9660
[Vitals]
Version: 0.7.9-153-g16a7302f-0ubuntu1~16.04.2 OS: Ubuntu 16.04 Provider: AWS - ami-1d4e7a66
[Recreate]
To recreate this
Launch an AWS instance using AMI ami-1d4e7a66 and attach several disks (I used 25 additional disks)
Format and mount all 25: mkdir /media/{b..z} for i in {b..z}; do mkfs -t ext4 /dev/xvd$i mount /dev/xvd$i /media/$i echo "/dev/xvd$i /media/$i ext4 defaults,nofail 0 2" >> /etc/fstab done
reboot instance
Since this is a race, multiple may be necessary. A reproducer script is attached.