canonical / multipass

Multipass orchestrates virtual Ubuntu instances
https://multipass.run
GNU General Public License v3.0
7.54k stars 635 forks source link

Multipass instance on M1 Silicon becomes read-only after a restart #3455

Closed uzhielbautista closed 3 months ago

uzhielbautista commented 3 months ago

Describe the bug Multipass instance on M1 Silicon becomes read only after a restart

To Reproduce How, and what happened?

  1. multipass shell docker
  2. touch test.xt
  3. output is 'touch: cannot touch 'test.txt': Read-only file system'

Expected behavior test.txt file is generated

Logs

[2024-04-02T09:18:23.987] [warning] [url downloader] Error getting https://multipass.run/static/latest-release.json: Host multipass.run not found - trying cache.
[2024-04-02T09:18:23.988] [warning] [url downloader] Error getting https://codeload.github.com/canonical/multipass-blueprints/zip/refs/heads/main: Host codeload.github.com not found - trying cache.
[2024-04-02T09:18:23.995] [info] [update] Failed to fetch update info: failed to download from 'https://multipass.run/static/latest-release.json': Error opening https://multipass.run/static/latest-release.json
[2024-04-02T09:18:24.013] [info] [rpc] gRPC listening on unix:/var/run/multipass_socket
[2024-04-02T09:18:24.013] [debug] [async task] fetch manifest periodically
[2024-04-02T09:18:24.015] [warning] [url downloader] Error getting https://cloud-images.ubuntu.com/releases/streams/v1/index.json: Host cloud-images.ubuntu.com not found - trying cache.
[2024-04-02T09:18:24.015] [warning] [url downloader] Error getting https://cloud-images.ubuntu.com/buildd/daily/streams/v1/index.json: Host cloud-images.ubuntu.com not found - trying cache.
[2024-04-02T09:18:24.016] [warning] [url downloader] Error getting https://cloud-images.ubuntu.com/buildd/daily/streams/v1/com.ubuntu.cloud:daily:download.json: Host cloud-images.ubuntu.com not found - trying cache.
[2024-04-02T09:18:24.016] [warning] [url downloader] Error getting https://cloud-images.ubuntu.com/releases/streams/v1/com.ubuntu.cloud:released:download.json: Host cloud-images.ubuntu.com not found - trying cache.
[2024-04-02T09:18:24.019] [debug] [qemu-img] [403] started: qemu-img snapshot -l /var/root/Library/Application Support/multipassd/qemu/vault/instances/primary/ubuntu-22.04-server-cloudimg-arm64.img
[2024-04-02T09:18:24.054] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T09:18:24.379] [debug] [qemu-img] [406] started: qemu-img amend -o compat=1.1 /var/root/Library/Application Support/multipassd/qemu/vault/instances/primary/ubuntu-22.04-server-cloudimg-arm64.img
[2024-04-02T09:18:24.395] [debug] [qemu-img] [407] started: qemu-img snapshot -l /var/root/Library/Application Support/multipassd/qemu/vault/instances/sterling-bulldog/ubuntu-22.04-server-cloudimg-arm64.img
[2024-04-02T09:18:24.416] [debug] [qemu-img] [408] started: qemu-img amend -o compat=1.1 /var/root/Library/Application Support/multipassd/qemu/vault/instances/sterling-bulldog/ubuntu-22.04-server-cloudimg-arm64.img
[2024-04-02T09:18:24.430] [debug] [qemu-img] [410] started: qemu-img snapshot -l /var/root/Library/Application Support/multipassd/qemu/vault/instances/docker/ubuntu-22.04-server-cloudimg-arm64.img
[2024-04-02T09:18:24.445] [debug] [qemu-img] [412] started: qemu-img amend -o compat=1.1 /var/root/Library/Application Support/multipassd/qemu/vault/instances/docker/ubuntu-22.04-server-cloudimg-arm64.img
[2024-04-02T09:18:24.464] [info] [sshfs-mount-handler] initializing mount /Users/macuser/docker/multipass-shared => ~/multipass-shared in 'docker'
[2024-04-02T09:18:24.464] [info] [sshfs-mount-handler] initializing mount /Users/macuser/multipass/docker => docker in 'docker'
[2024-04-02T09:18:24.465] [info] [image vault] Source image 22.04 is expired. Removing it from the cache.
[2024-04-02T09:18:24.466] [info] [daemon] Starting Multipass 1.13.1+mac
[2024-04-02T09:18:24.466] [info] [daemon] Daemon arguments: /Library/Application Support/com.canonical.multipass/bin/multipassd --verbosity debug
[2024-04-02T09:33:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T09:33:23.524] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T09:48:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T09:48:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T10:03:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T10:03:23.521] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T10:18:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T10:18:23.524] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T10:33:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T10:33:23.527] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T10:48:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T10:48:23.525] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T11:03:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T11:03:23.524] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T11:18:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T11:18:23.520] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T11:33:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T11:33:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T11:48:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T11:48:23.521] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T12:03:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T12:03:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T12:18:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T12:18:23.522] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T12:33:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T12:33:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T12:48:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T12:48:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T13:03:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T13:03:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T13:18:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T13:18:23.524] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T13:33:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T13:33:23.525] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T13:48:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T13:48:23.525] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T14:03:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T14:03:23.522] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T14:18:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T14:18:23.525] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T14:33:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T14:33:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T14:48:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T14:48:23.525] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T15:03:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T15:03:23.525] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T15:18:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T15:18:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T15:18:24.520] [debug] [image vault] Checking for images to update…
[2024-04-02T15:33:23.517] [debug] [async task] fetch manifest periodically
[2024-04-02T15:33:23.526] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T15:48:23.516] [debug] [async task] fetch manifest periodically
[2024-04-02T15:48:23.523] [info] [VMImageHost] Did not find any supported products in "appliance"
[2024-04-02T15:52:59.358] [debug] [docker] process working dir ''
[2024-04-02T15:52:59.358] [info] [docker] process program 'qemu-system-aarch64'
[2024-04-02T15:52:59.358] [info] [docker] process arguments '-machine, virt,gic-version=3, -accel, hvf, -drive, file=/Library/Application Support/com.canonical.multipass/bin/../Resources/qemu/edk2-aarch64-code.fd,if=pflash,format=raw,readonly=on, -cpu, host, -nic, vmnet-shared,model=virtio-net-pci,mac=52:54:00:9e:1c:02, -device, virtio-scsi-pci,id=scsi0, -drive, file=/var/root/Library/Application Support/multipassd/qemu/vault/instances/docker/ubuntu-22.04-server-cloudimg-arm64.img,if=none,format=qcow2,discard=unmap,id=hda, -device, scsi-hd,drive=hda,bus=scsi0.0, -smp, 8, -m, 8192M, -qmp, stdio, -chardev, null,id=char0, -serial, chardev:char0, -nographic, -cdrom, /var/root/Library/Application Support/multipassd/qemu/vault/instances/docker/cloud-init-config.iso'
[2024-04-02T15:52:59.392] [debug] [qemu-system-aarch64] [8590] started: qemu-system-aarch64 -machine virt,gic-version=3 -nographic -dump-vmstate /private/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/multipassd.JKVHHw
[2024-04-02T15:53:00.167] [info] [docker] process state changed to Starting
[2024-04-02T15:53:00.174] [info] [docker] process state changed to Running
[2024-04-02T15:53:00.174] [debug] [qemu-system-aarch64] [8591] started: qemu-system-aarch64 -machine virt,gic-version=3 -accel hvf -drive file=/Library/Application Support/com.canonical.multipass/bin/../Resources/qemu/edk2-aarch64-code.fd,if=pflash,format=raw,readonly=on -cpu host -nic vmnet-shared,model=virtio-net-pci,mac=52:54:00:9e:1c:02 -device virtio-scsi-pci,id=scsi0 -drive file=/var/root/Library/Application Support/multipassd/qemu/vault/instances/docker/ubuntu-22.04-server-cloudimg-arm64.img,if=none,format=qcow2,discard=unmap,id=hda -device scsi-hd,drive=hda,bus=scsi0.0 -smp 8 -m 8192M -qmp stdio -chardev null,id=char0 -serial chardev:char0 -nographic -cdrom /var/root/Library/Application Support/multipassd/qemu/vault/instances/docker/cloud-init-config.iso
[2024-04-02T15:53:00.174] [info] [docker] process started
[2024-04-02T15:53:00.175] [debug] [docker] Waiting for SSH to be up
[2024-04-02T15:53:00.573] [debug] [docker] QMP: {"QMP": {"version": {"qemu": {"micro": 1, "minor": 2, "major": 8}, "package": ""}, "capabilities": ["oob"]}}^M

[2024-04-02T15:53:00.626] [debug] [docker] QMP: {"return": {}}^M

[2024-04-02T15:53:13.928] [debug] [docker] QMP: {"timestamp": {"seconds": 1712035393, "microseconds": 928290}, "event": "NIC_RX_FILTER_CHANGED", "data": {"path": "/machine/unattached/device[13]/virtio-backend"}}^M

[2024-04-02T15:53:26.647] [debug] [ssh session] Executing 'which snap'
[2024-04-02T15:53:26.993] [debug] [ssh session] Executing 'sudo snap list multipass-sshfs'
[2024-04-02T15:53:35.128] [debug] [ssh session] Executing 'ip -brief -family inet address show scope global'
[2024-04-02T15:53:41.999] [warning] [daemon] Removing mount "docker" from 'docker': failed to obtain exit status for remote process 'sudo snap list multipass-sshfs': timeout

[2024-04-02T15:53:42.047] [debug] [ssh session] Executing 'which snap'
[2024-04-02T15:53:42.201] [debug] [ssh session] Executing 'sudo snap list multipass-sshfs'
[2024-04-02T15:53:57.203] [warning] [daemon] Removing mount "~/multipass-shared" from 'docker': failed to obtain exit status for remote process 'sudo snap list multipass-sshfs': timeout

Additional info

Name: docker State: Running Snapshots: 1 IPv4: 192.168.64.8 Release: Ubuntu 22.04.4 LTS Image hash: 0f5f68b9b746 (Ubuntu 22.04 LTS) CPU(s): 8 Load: 0.00 0.00 0.00 Disk usage: 12.0GiB out of 61.9GiB Memory usage: 153.8MiB out of 7.7GiB Mounts: --

Name: sterling-bulldog State: Suspended Snapshots: 0 IPv4: -- Release: -- Image hash: dddfb1741f16 (Ubuntu 22.04 LTS) CPU(s): -- Load: -- Disk usage: -- Memory usage: -- Mounts: --


- `qemu`

**Additional context**
The instances appear to work fine, eg they are writable after creation. But then something seems to cause them to become read-only after a restart.
ricab commented 3 months ago

Hi @uzhielbautista, I just tried the docker blueprint on a mac and I could not reproduce this. You don't happen to be running out of space on your host? Also, you did not happen to execute the touch command in step 2 on a mounted directory?

Your log shows your mounts being removed but commands not being responsive. Is your instance otherwise functional? Can you echo blah > /dev/null? What is the output of sudo lsblk --output NAME,FSTYPE,LABEL,MOUNTPOINT,MODE,RO,STATE?

uzhielbautista commented 3 months ago

Hi @ricab, thanks for the prompt reply.

The host has ample free space (about 100gb) and the docker multipass instance occupies relatively small space, as it was to simply test out portainer.

I executed $ touch test.txt in the home directory of the instance (/home/ubuntu) but on a similar note I found that I can no longer mount a directory, whereas initially I could with this instance. Output below:

machost$ multipass mount `realpath tempdir` docker:/home/ubuntu/tempdir
mount failed: The following errors occurred:
error mounting "/home/ubuntu/tempdir": failed to obtain exit status for remote process 'sudo snap list multipass-sshfs': timeout

The instance is generally functional, I can echo to /dev/null, run basic commands like top and navigate the filesystem.

The error is produced only when it attempts to write to the instance's disk image. I'm not sure if this is something to do with Macos' security/permission settings but I don't recall this kind of problem before with multipass until the recent versions of Macos Sonoma.

ubuntu@docker:~$ echo "hello" > world.txt
-bash: world.txt: Read-only file system
...
ubuntu@docker:~$ apt-get update
Reading package lists... Done
W: Not using locking for read only lock file /var/lib/apt/lists/lock
W: Problem unlinking the file /var/lib/apt/lists/partial/ports.ubuntu.com_ubuntu-ports_dists_jammy_InRelease - PrepareFiles (13: Permission denied)
...
W: chown to _apt:root of directory /var/lib/apt/lists/partial failed - SetupAPTPartialDirectory (30: Read-only file system)
W: chmod 0700 of directory /var/lib/apt/lists/partial failed - SetupAPTPartialDirectory (30: Read-only file system)
W: chown to _apt:root of directory /var/lib/apt/lists/auxfiles failed - SetupAPTPartialDirectory (30: Read-only file system)
W: chmod 0755 of directory /var/lib/apt/lists/auxfiles failed - SetupAPTPartialDirectory (30: Read-only file system)
W: Not using locking for read only lock file /var/lib/apt/lists/lock
W: Problem unlinking the file /var/lib/apt/lists/partial/.apt-acquire-privs-test.eYycC4 - IsAccessibleBySandboxUser (30: Read-only file system)
uzhielbautista commented 3 months ago

Hi @ricab , I've done some further testing and I think I know what might be happening.

dmesg inside the instance showed an error a system.journal file being corrupted, causing the disk image to mount in read-only mode, as below.

[Tue Apr  2 15:53:10 2024] EXT4-fs (sda1): re-mounted. Opts: discard,errors=remount-ro. Quota mode: none.
[Tue Apr  2 15:53:10 2024] alua: device handler registered
[Tue Apr  2 15:53:10 2024] systemd-journald[441]: Received client request to flush runtime journal.
[Tue Apr  2 15:53:10 2024] emc: device handler registered
[Tue Apr  2 15:53:10 2024] rdac: device handler registered
[Tue Apr  2 15:53:10 2024] systemd-journald[441]: /var/log/journal/addefa2ecf9d4c07bdd2735d2813d2b8/system.journal: Journal file corrupted, rotating.
...
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1): ext4_validate_block_bitmap:420: comm systemd-tmpfile: bg 46: bad block bitmap checksum
[Tue Apr  2 15:53:11 2024] Aborting journal on device sda1-8.
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm apparmor_parser: Detected aborted journal
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm apparmor_parser: Detected aborted journal
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[Tue Apr  2 15:53:11 2024] EXT4-fs (sda1): Remounting filesystem read-only
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1) in ext4_orphan_add:188: Journal has aborted
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1) in ext4_reserve_inode_write:5789: Journal has aborted
[Tue Apr  2 15:53:11 2024] EXT4-fs error (device sda1): ext4_mkdir:3038: inode #518674: comm systemd-tmpfile: mark_inode_dirty error
...
[Tue Apr  2 15:53:13 2024] random: crng init done
[Tue Apr  2 15:53:13 2024] random: 254 urandom warning(s) missed due to ratelimiting
[Tue Apr  2 15:53:14 2024] EXT4-fs error (device sda1): ext4_validate_block_bitmap:420: comm ext4lazyinit: bg 126: bad block bitmap checksum

Given that this happens to every instance I install on this machine in relatively short time, there could be something terribly wrong with the hardware SSD, the Sonoma is somehow corrupting the files, or I am just having terrible luck..

Thanks again for looking into this.. I'll close this ticket now and will try installing multipass again once I've fully reset the system.

ricab commented 3 months ago

How frustrating :slightly_frowning_face: One thing that could be happening is macOS killing your instances when you shutdown/restart. We try to suspend any running instances, but that can take a little while and if macOS decides it doesn't want to wait, pulling the plug on an instance can indeed corrupt it. Perhaps try to suspend or shut them down beforehand.

Also, Multipass now has snapshots. You can try to take a safe snapshot of your instance while it's behaving, to restore when this happens again.

Hope this helps a little. Let us know if you find any more clues as to what is causing this corruption.

uzhielbautista commented 3 months ago

I think you might be right about the instances being killed off causing the corruption.

Also I should have added that I did use the snapshot feature, which works perfectly as I was able to restore a working instance yesterday. I could take the snapshots daily using a scheduled task which would be useful in my case.

At last this should get me through until i've done a disk repair and clean install of Macos.

Thank you again for following up.

ricab commented 3 months ago

Alright, thanks for the feedback @uzhielbautista! I am glad you are finding snapshots useful. Let me just call your attention to a couple of points to pay attention to. Quoting from here:

  • Long chains of snapshots have a detrimental effect on performance. Since they rely on layers of disk diffs, the more snapshots there are in a sequence, the more hops are necessary to read data that is recorded by the most ancient layers.
  • While snapshots are useful to save and recover instance states, their utility as safe backups is limited. Since they are stored in the same medium as the original images, they are as likely to be affected by disk failures.
uzhielbautista commented 3 months ago

Thank you for pointing this out @ricab!