coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

Persistent /var on RAID: What am I doing wrong? #1607

Closed runiq closed 11 months ago

runiq commented 12 months ago

Bug

I want to use mirrored boot disks, but I want to put /var on a separate partition persistently. Because persistent RAID partitions don't seem to be supported out of the box, I'm following the advice in this comment to create the RAID+filesystem for /var in my own Systemd unit. While this works in principle, it seems Ignition requires that my /var filesystem be available while it is running. Is this somehow possible to achieve?

Operating System Version

Fedora CoreOS 38.20231002.3.1

Ignition Version

3.4.0

Environment

Libvirt on Fedora Kinoite 38.20231103.0

Expected Behavior

Actual Behavior

It looks like my /var filesystem must be available during the Ignition run for the system to be functional. After Ignition runs its course (without apparent error), my /var only seems to be 'half-populated,' so to speak. These are the issues I see:

Reproduction Steps

Use the following Butane file:

repro.bu ```yaml variant: "fcos" version: "1.5.0" boot_device: mirror: devices: - "/dev/disk/by-id/virtio-root-1" - "/dev/disk/by-id/virtio-root-2" systemd: units: - name: "serial-getty@ttyS0.service" dropins: - name: "autologin-core.conf" contents: | [Service] # Override Execstart in main unit ExecStart= # Add new Execstart with `-` prefix to ignore failure ExecStart=-/usr/sbin/agetty --autologin core --noclear %I $TERM TTYVTDisallocate=no - name: "create-var.service" enabled: true contents: | [Unit] Description=Create md-var RAID and var filesystem DefaultDependencies=no # We 'slot' this in between the component devices of the RAID volume # and the /var mount: After=local-fs-pre.target After=dev-disk-by\x2dpartlabel-var\x2d1.device After=dev-disk-by\x2dpartlabel-var\x2d2.device Before=systemd-fsck@dev-md-md\x2dvar.service Before=var.mount # The RAID itself and the filesystem on it should NOT yet exist for this # unit to run ConditionPathExists=!/dev/md/md-var ConditionPathExists=!/dev/disk/by-label/var [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/bash -c 'echo yes | /usr/sbin/mdadm --create md-var \ --verbose \ --homehost=any \ --level=raid1 \ --raid-devices=2 \ /dev/disk/by-partlabel/var-1 \ /dev/disk/by-partlabel/var-2' ExecStart=/usr/bin/bash -c 'ls -l /var 1>&2' # mkfs.xfs fails if there is already a filesystem present on the device ExecStart=/usr/sbin/mkfs.xfs -Lvar /dev/md/md-var [Install] WantedBy=dev-md-md\x2dvar.device - name: "var.mount" enabled: false contents: | [Unit] Requires=systemd-fsck@dev-md-md\x2dvar.service After=systemd-fsck@dev-md-md\x2dvar.service [Mount] What=/dev/md/md-var Where=/var Type=xfs Options=defaults,strictatime,lazytime,prjquota [Install] RequiredBy=local-fs.target storage: disks: - device: "/dev/disk/by-id/virtio-root-1" partitions: - label: "esp-1" wipe_partition_entry: true - label: "boot-1" wipe_partition_entry: true - label: "root-1" size_mib: 8400 wipe_partition_entry: true - label: "var-1" wipe_partition_entry: false type_guid: "A19D880F-05FC-4D3B-A006-743F0F84911E" - device: "/dev/disk/by-id/virtio-root-2" partitions: - label: "esp-2" wipe_partition_entry: true - label: "boot-2" wipe_partition_entry: true - label: "root-2" size_mib: 8400 wipe_partition_entry: true - label: "var-2" wipe_partition_entry: false type_guid: "A19D880F-05FC-4D3B-A006-743F0F84911E" filesystems: - device: "/dev/md/md-boot" wipe_filesystem: true - device: "/dev/md/md-root" wipe_filesystem: true format: "xfs" ```

Compile it to Ignition:

repro.ign ```json { "ignition": { "version": "3.4.0" }, "storage": { "disks": [ { "device": "/dev/disk/by-id/virtio-root-1", "partitions": [ { "label": "bios-1", "sizeMiB": 1, "typeGuid": "21686148-6449-6E6F-744E-656564454649" }, { "label": "esp-1", "sizeMiB": 127, "typeGuid": "C12A7328-F81F-11D2-BA4B-00A0C93EC93B", "wipePartitionEntry": true }, { "label": "boot-1", "sizeMiB": 384, "wipePartitionEntry": true }, { "label": "root-1", "sizeMiB": 8400, "wipePartitionEntry": true }, { "label": "var-1", "typeGuid": "A19D880F-05FC-4D3B-A006-743F0F84911E", "wipePartitionEntry": false } ], "wipeTable": true }, { "device": "/dev/disk/by-id/virtio-root-2", "partitions": [ { "label": "bios-2", "sizeMiB": 1, "typeGuid": "21686148-6449-6E6F-744E-656564454649" }, { "label": "esp-2", "sizeMiB": 127, "typeGuid": "C12A7328-F81F-11D2-BA4B-00A0C93EC93B", "wipePartitionEntry": true }, { "label": "boot-2", "sizeMiB": 384, "wipePartitionEntry": true }, { "label": "root-2", "sizeMiB": 8400, "wipePartitionEntry": true }, { "label": "var-2", "typeGuid": "A19D880F-05FC-4D3B-A006-743F0F84911E", "wipePartitionEntry": false } ], "wipeTable": true } ], "filesystems": [ { "device": "/dev/disk/by-partlabel/esp-1", "format": "vfat", "label": "esp-1", "wipeFilesystem": true }, { "device": "/dev/disk/by-partlabel/esp-2", "format": "vfat", "label": "esp-2", "wipeFilesystem": true }, { "device": "/dev/md/md-boot", "format": "ext4", "label": "boot", "wipeFilesystem": true }, { "device": "/dev/md/md-root", "format": "xfs", "label": "root", "wipeFilesystem": true } ], "raid": [ { "devices": [ "/dev/disk/by-partlabel/boot-1", "/dev/disk/by-partlabel/boot-2" ], "level": "raid1", "name": "md-boot", "options": [ "--metadata=1.0" ] }, { "devices": [ "/dev/disk/by-partlabel/root-1", "/dev/disk/by-partlabel/root-2" ], "level": "raid1", "name": "md-root" } ] }, "systemd": { "units": [ { "dropins": [ { "contents": "[Service]\n# Override Execstart in main unit\nExecStart=\n# Add new Execstart with `-` prefix to ignore failure\nExecStart=-/usr/sbin/agetty --autologin core --noclear %I $TERM\nTTYVTDisallocate=no\n", "name": "autologin-core.conf" } ], "name": "serial-getty@ttyS0.service" }, { "contents": "[Unit]\nDescription=Create md-var RAID and var filesystem\nDefaultDependencies=no\n\n# We 'slot' this in between the component devices of the RAID volume\n# and the /var mount:\nAfter=local-fs-pre.target\nAfter=dev-disk-by\\x2dpartlabel-var\\x2d1.device\nAfter=dev-disk-by\\x2dpartlabel-var\\x2d2.device\nBefore=systemd-fsck@dev-md-md\\x2dvar.service\nBefore=var.mount\n\n# The RAID itself and the filesystem on it should NOT yet exist for this\n# unit to run\nConditionPathExists=!/dev/md/md-var\nConditionPathExists=!/dev/disk/by-label/var\n\n[Service]\nType=oneshot\nRemainAfterExit=yes\nExecStart=/usr/bin/bash -c 'echo yes | /usr/sbin/mdadm --create md-var \\\n\t--verbose \\\n\t--homehost=any \\\n\t--level=raid1 \\\n\t--raid-devices=2 \\\n\t/dev/disk/by-partlabel/var-1 \\\n\t/dev/disk/by-partlabel/var-2'\nExecStart=/usr/bin/bash -c 'ls -l /var 1\u003e\u00262'\n# mkfs.xfs fails if there is already a filesystem present on the device\nExecStart=/usr/sbin/mkfs.xfs -Lvar /dev/md/md-var\n\n[Install]\nWantedBy=dev-md-md\\x2dvar.device\n", "enabled": true, "name": "create-var.service" }, { "contents": "[Unit]\nRequires=systemd-fsck@dev-md-md\\x2dvar.service\nAfter=systemd-fsck@dev-md-md\\x2dvar.service\n\n[Mount]\nWhat=/dev/md/md-var\nWhere=/var\nType=xfs\nOptions=defaults,strictatime,lazytime,prjquota\n\n[Install]\nRequiredBy=local-fs.target\n", "enabled": false, "name": "var.mount" } ] } } ```

Create the VM using libvirt (adapted from the documentation):

#!/usr/bin/env bash
set -euxo pipefail

# Download image
readonly image="$(podman run --rm --init \
    --security-opt label=disable \
    -v ./:/data \
    -w /data \
    quay.io/coreos/coreos-installer:release \
        download --stream stable -p qemu -f qcow2.xz --decompress)"
mv "$image" fcos.qcow2
rm -f "${image}.sig"

IGNITION_CONFIG="$PWD/repro.ign"
IMAGE="$PWD/fcos.qcow2"
chcon --verbose --type svirt_home_t ${IGNITION_CONFIG} ${IMAGE}

# Remove old VM & storage from previous runs if they exist
virsh \
    --connect=qemu:///session \
    destroy --remove-logs \
    repro || true
virsh \
    --connect=qemu:///session \
    undefine \
    --nvram \
    --managed-save \
    --checkpoints-metadata \
    --snapshots-metadata \
    --remove-all-storage \
    repro || true

# Provision new VM
virt-install \
    --connect=qemu:///session \
    --name=repro \
    --vcpus=2 \
    --memory=8192 \
    --boot=uefi \
    --graphics=none \
    --disk="path=root-1.qcow2,size=12,serial=root-1,backing_store=${IMAGE}" \
    --disk="path=root-2.qcow2,size=12,serial=root-2" \
    --os-variant="fedora-coreos-stable" \
    --qemu-commandline="-fw_cfg name=opt/com.coreos/config,file=${IGNITION_CONFIG}" \
    --import
travier commented 12 months ago

Moving to the tracker as this is about Fedora CoreOS.

Nemric commented 11 months ago

Hi, Did just read your butane with one eye, but at first sight I notice that you don't use the with_mount_unit: true option in filesystems section here is my config :

  filesystems:
    - path: /var
      device: /dev/md/Raid
      format: xfs
      label: Var
      wipe_filesystem: false
      with_mount_unit: true

I use wipe_filesystem: false because it runs on live boot / pxe

runiq commented 11 months ago

Hi, Did just read your butane with one eye, but at first sight I notice that you don't use the with_mount_unit: true option in filesystems section

I'm using a dedicated mount unit (in the systemd:unit: section) instead. If I use filesystems: and raid: entries like you suggest, the data on /var is lost when I replace a disk in the RAID and then reprovision the system again. For me, this is an important consideration, because I'd like to run FCOS on a NAS. Since this will be a long-running system, reprovisioning will probably happen, so I'm trying to account for it.

dustymabe commented 11 months ago

I imagine part of the problem here is that systemd-tmpfiles is running before your /var/ is mounted so necessary files for some sevices don't get created. I wonder if you reboot the system so that systemd-tmpfiles runs again if some of the problems go away.

Do you have a boot log for the system that you could share?

Nemric commented 11 months ago

I really think that you should use with_mount_unit: true and if you want to reprovision your system you should use wipe_filesystem: false A far as I can remember, when I set up my PXE booted server for the first time, I did wipe/format/create everything to have a clean RAID, and after that, I did change my butane/ignition to deal with a persistent RAID /var

here is my actual RAID config

storage:
  raid:
    - name: Raid
      level: mirror
      devices:
        - /dev/disk/by-id/ata-WDC_WD10SPZX-80Z10T2_WD-WX41A49H9FT4
        - /dev/disk/by-id/ata-WDC_WD10SPZX-80Z10T2_WD-WXL1A49KPYFD
      options:
        - --metadata=1.2
        - --assume-clean
        - --uuid=7ec8d4df:823fae52:c55d5e56:e773b281

--metadata=1.2 --uuid=7ec8d4df:823fae52:c55d5e56:e773b281 are infos that I get from some mdadm.conf or else ^^ (as far as I can remember ...)

--assume-clean let my server boot without full raid check/build/sync

runiq commented 11 months ago

@Nemric That was actually what I tried originally, and I would absolutely love it if it worked!

Unfortunately, once I replace a disk in the RAID, none of the options you suggest help. Upon reprovisioning, the RAID always gets recreated, Ignition (or rather, libblkid) doesn't recognize the filesystem on top anymore, and recreates it, with all data gone.

runiq commented 11 months ago

@dustymabe That appears to have helped with a lot of things, thanks! Unfortunately the core user's home directory is still not there after a reboot. I haven't had much time to dig in deeper, I'll try and get a boot log and more info tomorrow.

dustymabe commented 11 months ago

Unfortunately the core user's home directory is still not there after a reboot.

That's probably because Ignition is the one who would create that directory (under /var/home) and that runs in the initramfs.

runiq commented 11 months ago

Okay, I think I solved this. I'm not getting errors if I do the following dance right after Ignition runs:

  1. Set up a temporary mount point for my own var partition (e.g. /mnt/var)
  2. Punt the systemd journal to /run with journalctl --relinquish-var
  3. Move everything from /sysroot/ostree/deploy/fedora-coreos/var to /mnt/var
  4. Actually mount /var (using var.mount from the OP)
  5. Move the journal back to /var with journalctl --flush

This also works after a reboot, since the ExecCondition check only creates the RAID if it cannot assemble it. The journal punt is required to keep logs of the first boot, otherwise it'll be 'lost' (or rather, written to someplace in /sysroot, probably).

Here is an updated Butane config that incorporates all this:

repro.bu ```yaml variant: "fcos" version: "1.5.0" boot_device: mirror: devices: - "/dev/disk/by-id/virtio-root-1" - "/dev/disk/by-id/virtio-root-2" systemd: units: - name: "serial-getty@ttyS0.service" dropins: - name: "autologin-core.conf" contents: | [Service] ExecStart= ExecStart=-/usr/sbin/agetty --autologin core --noclear %I $TERM TTYVTDisallocate=no - name: "create-var.service" enabled: true contents: | [Unit] Description=Create md-var RAID and var filesystem DefaultDependencies=no # We 'slot' this in between the component devices of the RAID volume # and the /var mount: After=dev-disk-by\x2dpartlabel-var\x2d1.device After=dev-disk-by\x2dpartlabel-var\x2d2.device Before=local-fs-pre.target [Service] Type=oneshot RemainAfterExit=yes # The RAID itself and the filesystem on it should NOT be able to be assembled for this # unit to run ExecCondition=/usr/bin/bash -c '! /usr/sbin/mdadm --assemble --scan --name any:md-var' # If the array could NOT be assembled, we create it ExecStart=/usr/bin/yes | /usr/sbin/mdadm --create md-var \ --force \ --run \ --homehost=any \ --level=raid1 \ --raid-devices=2 \ /dev/disk/by-partlabel/var-1 \ /dev/disk/by-partlabel/var-2 ExecStart=/usr/bin/udevadm settle --exit-if-exists /dev/md/md-var # Create filesystem on the newly created RAID ExecStart=/usr/sbin/mkfs.xfs -Lvar /dev/md/md-var ExecStart=/usr/bin/udevadm settle --exit-if-exists /dev/disk/by-label/var # Move stuff over to the new var partition ExecStart=/usr/bin/mkdir -p /var/mnt/var ExecStart=/usr/bin/mount /dev/disk/by-label/var /var/mnt/var ExecStart=/usr/bin/journalctl --relinquish-var ExecStart=/usr/bin/bash -c '/usr/bin/find /sysroot/ostree/deploy/fedora-coreos/var -mindepth 1 -maxdepth 1 -print0 | xargs -0 /usr/bin/mv -t /var/mnt/var' ExecStart=/usr/bin/sync ExecStart=/usr/bin/umount /var/mnt/var [Install] WantedBy=dev-disk-by\x2dlabel-var.device - name: "var.mount" enabled: true contents: | [Unit] After=systemd-fsck@dev-disk-by\x2dlabel-var.service Requires=systemd-fsck@dev-disk-by\x2dlabel-var.service Before=local-fs.target [Mount] What=/dev/disk/by-label/var Where=/var Type=xfs Options=defaults,strictatime,lazytime,prjquota [Install] RequiredBy=local-fs.target - name: "flush-journald.service" enabled: true contents: | [Unit] After=var.mount [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/journalctl --flush [Install] RequiredBy=var.mount storage: disks: - device: "/dev/disk/by-id/virtio-root-1" partitions: - label: "esp-1" wipe_partition_entry: true - label: "boot-1" wipe_partition_entry: true - label: "root-1" size_mib: 8400 wipe_partition_entry: true - label: "var-1" wipe_partition_entry: false type_guid: "A19D880F-05FC-4D3B-A006-743F0F84911E" - device: "/dev/disk/by-id/virtio-root-2" partitions: - label: "esp-2" wipe_partition_entry: true - label: "boot-2" wipe_partition_entry: true - label: "root-2" size_mib: 8400 wipe_partition_entry: true - label: "var-2" wipe_partition_entry: false type_guid: "A19D880F-05FC-4D3B-A006-743F0F84911E" filesystems: - device: "/dev/md/md-boot" wipe_filesystem: true - device: "/dev/md/md-root" wipe_filesystem: true format: "xfs" ```

In the end, all this amounts to is a check to see if we can assemble the RAID instead of unconditionally (re)creating it.

alcir commented 9 months ago

Okay, I think I solved this. I'm not getting errors if I do the following dance right after Ignition runs:

1. Set up a temporary mount point for my own `var` partition (e.g. `/mnt/var`)

2. Punt the systemd journal to `/run` with `journalctl --relinquish-var`

3. Move everything from `/sysroot/ostree/deploy/fedora-coreos/var` to `/mnt/var`

4. _Actually_ mount `/var` (using `var.mount` from the OP)

5. Move the journal back to `/var` with `journalctl --flush`

It is a bit complicated solution, isn't it?

jlebon commented 9 months ago

I think probably this should be an issue against Ignition to not treat a degraded RAID device the same as no RAID device. E.g. it'd have to check if at least one of the devices in the list is a member of a RAID array with the wanted properties.

runiq commented 9 months ago

The Ignition issue is linked in the OP: https://github.com/coreos/ignition/issues/579