ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1718287

Launchpad details

affected_projects = []
assignee = raharper
assignee_name = Ryan Harper
date_closed = 2017-12-14T21:02:09.269716+00:00
date_created = 2017-09-19T20:30:56.010514+00:00
date_fix_committed = 2017-10-04T20:51:45.101550+00:00
date_fix_released = 2017-12-14T21:02:09.269716+00:00
id = 1718287
importance = high
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1718287
milestone = None
owner = billy-olsen
owner_name = Billy Olsen
private = False
status = fix_released
submitter = billy-olsen
submitter_name = Billy Olsen
tags = ['sts']
duplicates = []

Launchpad user Billy Olsen(billy-olsen) wrote on 2017-09-19T20:30:56.010514+00:00

[Issue]

After rebooting a 16.04 AWS instance (ami-1d4e7a66) with several external disks attached, formatted, and added to /etc/fstab - systemd mount targets fail to mount with:

● media-v.mount - /media/v Loaded: loaded (/etc/fstab; bad; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2017-09-19 20:12:18 UTC; 1min 54s ago Where: /media/v What: /dev/xvdv Docs: man:fstab(5) man:systemd-fstab-generator(8) Process: 1196 ExecMount=/bin/mount /dev/xvdv /media/v -t ext4 -o defaults (code=exited, status=32)

Sep 19 20:12:17 ip-172-31-7-167 systemd[1]: Mounting /media/v... Sep 19 20:12:17 ip-172-31-7-167 mount[1196]: mount: /dev/xvdv is already mounted or /media/v busy Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: media-v.mount: Mount process exited, code=exited status=32 Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: Failed to mount /media/v. Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: media-v.mount: Unit entered failed state.

From the cloud-init logs, it appears that the the OVF datasource is mounting the device to find data:

2017-09-19 20:12:17,502 - util.py[DEBUG]: Peeking at /dev/xvdv (max_bytes=512) 2017-09-19 20:12:17,502 - util.py[DEBUG]: Reading from /proc/mounts (quiet=False) 2017-09-19 20:12:17,502 - util.py[DEBUG]: Read 2570 bytes from /proc/mounts ... 2017-09-19 20:12:17,506 - util.py[DEBUG]: Running command ['mount', '-o', 'ro,sync', '-t', 'iso9660', '/dev/xvdv', '/tmp/tmpw2tyqqid'] with allowed return codes [0] (shell=False, capture=True) 2017-09-19 20:12:17,545 - util.py[DEBUG]: Failed mount of '/dev/xvdv' as 'iso9660': Unexpected error while running command. Command: ['mount', '-o', 'ro,sync', '-t', 'iso9660', '/dev/xvdv', '/tmp/tmpw2tyqqid'] Exit code: 32 Reason: - Stdout: - Stderr: mount: wrong fs type, bad option, bad superblock on /dev/xvdv, missing codepage or helper program, or other error

           In some cases useful info is found in syslog - try
           dmesg | tail or so.

2017-09-19 20:12:17,545 - util.py[DEBUG]: Recursively deleting /tmp/tmpw2tyqqid 2017-09-19 20:12:17,545 - DataSourceOVF.py[DEBUG]: /dev/xvdv not mountable as iso9660

[Vitals]

Version: 0.7.9-153-g16a7302f-0ubuntu1~16.04.2 OS: Ubuntu 16.04 Provider: AWS - ami-1d4e7a66

[Recreate]

To recreate this

Launch an AWS instance using AMI ami-1d4e7a66 and attach several disks (I used 25 additional disks)
Format and mount all 25: mkdir /media/{b..z} for i in {b..z}; do mkfs -t ext4 /dev/xvd$i mount /dev/xvd$i /media/$i echo "/dev/xvd$i /media/$i ext4 defaults,nofail 0 2" >> /etc/fstab done
reboot instance

Since this is a race, multiple may be necessary. A reproducer script is attached.

ubuntu-server-builder commented 1 year ago

Launchpad user Billy Olsen(billy-olsen) wrote on 2017-09-19T20:30:56.010514+00:00

Launchpad attachments: cloud-init.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Billy Olsen(billy-olsen) wrote on 2017-09-19T20:34:21.370849+00:00

Launchpad attachments: Recreate script for monitoring mountpoints and rebooting

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2017-09-20T18:19:50.207080+00:00

The current analysis is that any parallel invocation of mount which specifies the same device is going to result in one of the two mounts to fail with EBUSY (or some other failure) do to how the underlying device handles either opens or mounts in the kernel. This is recreatable outside of cloud-init when running mounts in parallel pointing to the same disk.

rw mount loop

while true; do umount -f ${MNT_POINT} |: mount ${DISK} ${MNT_POINT} -t ext4 -o defaults || { echo "date +%s.%N: mount failed on COUNT=$COUNT"; exit 1 } echo "date +%s.%N: COUNT=$COUNT" > ${MNT_POINT}/lasttouch COUNT=$(($COUNT + 1)) done

ro mount loop

while true; do dd if=${DISK} bs=512 count=1 of=/dev/null

we expect this to fail, that's OK

mount -o ro,sync -t iso9660 ${DISK} ${MNT_POINT} |:
echo "`date +%s.%N`: READ COUNT=$COUNT"
COUNT=$(($COUNT + 1))

done

There isn't currently a general solution to prevent this race from occuring; so the focus is on reducing the likelihood of a race, and configuration changes to be resilient in the face of the race.

The linked changes to cloud-init will reduce the number of block devices that OVF will probe. The OVF Datasource is looking for block devices with ISO9660 filesystems; on an instance that has no ISO9660 filesystems, OVF will not perform any mount operations.

When crafting fstab entries, one can set the 6th column, (fs_passno) to zero (0) and systemd will not attempt to fsck the device prior to mounting. This does not remove the mount race, but it does reduce the amount of time between when the mount unit starts and when it completes.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2017-09-21T03:58:26.963034+00:00

Some more details, in particular the ami-id is needed:

https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-1d4e7a66

t2-micro is fine, in us-east1, add 26 volumes, mark for deletion.

Run the script and star the reboot check.

That triggered in a single reboot; Instance is old at this time, comes with linux-image-4.4.0-aws-1030.

I'm currently testing with the image updated; current linux image is 4.4.0-aws-1035; That's not reproducing as quickly, so far 25 reboots with no issues.

Separately on an a non-upgraded instance with the above ami and instance type as described, I've packaged and installed a cloud-init package from the linked branch; that's successfully run for 68 reboots with no issues.

i'll continue running these over night and update this in the morning.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2017-09-21T14:15:40.493129+00:00

Instance on 4.4.0-aws-1035 has 525 boots with no issues/no changes to cloud-init Instance on 4.4.0-aws-1030 has 650 boots with no issues/updated cloud-init

I believe it's still required to have the cloud-init changes; however, due to changes in the kernel/systemd or otherwise; systemd mounts are not interleaved with cloud-init probing. This behavior is very consistent within a specific kernel/systemd pair.

For example on the latest boot of 4.4.0-aws-1035, I can check the timestamps (first and last) of the mounts (ignoring xvda/rootfs)

% $ journalctl --utc -o short-precise | grep mounted.filesystem | grep -v xvda | sed -e 1b -e '$!d'

Sep 21 13:56:13.858878 ip-172-31-27-63 kernel: EXT4-fs (xvdr): mounted filesystem with ordered data mode. Opts: (null) Sep 21 13:56:14.129499 ip-172-31-27-63 kernel: EXT4-fs (xvdo): mounted filesystem with ordered data mode. Opts: (null)

And check the most recent cloud-init boot for when OVF search starts/stops

cloud-init-local starts OVF searching at: 2017-09-21 13:56:09,617 - handlers.py[DEBUG]: start: init-local/search-OVF: searching for local data from DataSourceOVF

Finishes at: 2017-09-21 13:56:10,322 - handlers.py[DEBUG]: finish: init-local/search-OVF: SUCCESS: no local data found from DataSourceOVF

So we probed before systemd mounts

Then at cloud-init.service (net mode) time, start at 2017-09-21 13:56:11,650 - handlers.py[DEBUG]: start: init-network/search-OVFNet: searching for network data from DataSourceOVFNet

Finish OVFNet at: 2017-09-21 13:56:12,246 - handlers.py[DEBUG]: finish: init-network/search-OVFNet: SUCCESS: no network data found from DataSourceOVFNet

And OVFNet probes before systemd mounts start at 13:56:13.85

ubuntu-server-builder commented 1 year ago

Launchpad user Robin Miller(robincello) wrote on 2017-10-12T19:21:27.956807+00:00

This issue seems to have returned on kernel 4.4.0-1038-aws with cloud-init 0.7.9-233-ge586fe35-0ubuntu1~16.04.2

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2017-10-17T19:14:05.278745+00:00

Currently validating cloud-init in xenial-proposed which resolves this specific issue with cloud-init racing systemd-fstab entries with the probing during DatasourceOVF.

cloud-init | 17.1-18-gd4f70470-0ubuntu1~16.04.1 | xenial-proposed | source, all

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-12-14T21:02:11.259951+00:00

This bug is believed to be fixed in cloud-init in 1705804. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

canonical / cloud-init

systemd mount targets fail due to device busy or already mounted #3005

rw mount loop

ro mount loop

we expect this to fail, that's OK