coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

Azure: Add predictable symlinks for secondary disks #1165

Open stereobutter opened 2 years ago

stereobutter commented 2 years ago

Similar to https://github.com/coreos/fedora-coreos-tracker/issues/1122 symlinks for secondary disks on azure are not stable. From what I can gather from trial an error there are /dev/disk/azure/resource and /dev/disk/azure/root where it appears /dev/disk/azure/root always references the boot disk correctly however /dev/disk/azure/resource is randomly assigned to any one of the secondary disks (including the temporary disk that every VM gets)

lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0   32G  0 disk
└─sda1   8:1    0   32G  0 part /var
sdb      8:16   0   32G  0 disk
├─sdb1   8:17   0    1M  0 part
├─sdb2   8:18   0  127M  0 part
├─sdb3   8:19   0  384M  0 part /boot
└─sdb4   8:20   0 31.5G  0 part /sysroot/ostree/deploy/fedora-coreos/var
                                /usr
                                /etc
                                /
                                /sysroot
sdc      8:32   0    7G  0 disk
└─sdc1   8:33   0    7G  0 part

where sda is a secondary disk I attached to the VM (to place /var on it via ignition), sdb the boot disk and sdc the temporary disk attached by azure.

looking at the symlinks gives

ls -la /dev/disk/*/*
lrwxrwxrwx. 1 root root  9 Apr 12 16:59  /dev/disk/azure/resource -> ../../sdc
lrwxrwxrwx. 1 root root 10 Apr 12 16:59  /dev/disk/azure/resource-part1 -> ../../sdc1
lrwxrwxrwx. 1 root root  9 Apr 12 16:59  /dev/disk/azure/root -> ../../sdb
...

What I'd actually like to to is place the user data on another disk via ignition so that I can backup that disk separately and in case disaster hits I can create a new VM with a snapshot of that disk (using wipe_table: false for that device).

stereobutter commented 2 years ago

I had some success with the udev rules from WALinuxAgent

storage:
  files:
    - path: /etc/udev/rules.d/66-azure-storage.rules
      mode: 0750
      contents: 
        source: https://raw.githubusercontent.com/Azure/WALinuxAgent/master/config/66-azure-storage.rules
    - path: /etc/udev/rules.d/99-azure-product-uuid.rules
      mode: 0750
      contents: 
        source: https://raw.githubusercontent.com/Azure/WALinuxAgent/master/config/99-azure-product-uuid.rules

These create symlinks in /dev/disk/by-path/ of the following form (where N is the LUN of the disk):

Using these I was able move /var to a data disk using

  disks:
  - device: /dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0
    wipe_table: false
    partitions:
    - size_mib: 0
      start_mib: 0
      label: user_data
  filesystems:
    - path: /var
      device: /dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0
      format: xfs
      with_mount_unit: true

However creating a new VM with the same butane file and a copy of the data disk attached fails to boot with

[   10.200850] ignition[1190]: INFO     : mount: op(2): [started]  mounting "/dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0" at "/sysroot/var" with type "xfs" and options ""
[   10.214311] ignition[1190]: DEBUG    : mount: op(2): executing: "mount" "-o" "" "-t" "xfs" "/dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0" "/sysroot/var"
[   10.225680] XFS (sdc): Metadata CRC error detected at xfs_sb_read_verify+0x14d/0x170 [xfs], xfs_sb block 0x0 
[   10.232931] XFS (sdc): Unmount and run xfs_repair
[   10.236625] XFS (sdc): First 128 bytes of corrupted metadata buffer:
[   10.241216] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 40 00 00  XFSB.........@..
[   10.246479] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   10.250978] 00000020: 5d 8c 5f 81 54 ef 41 53 ac 76 1a 0e 17 ba c9 76  ]._.T.AS.v.....v
[   10.256190] 00000030: 00 00 00 00 00 20 00 09 00 00 00 00 00 00 00 80  ..... ..........
[   10.262230] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
[   10.268172] 00000050: 00 00 00 01 00 10 00 00 00 00 00 04 00 00 00 00  ................
[   10.279351] 00000060: 00 00 0a 00 bc b5 10 00 02 00 00 08 00 00 00 00  ................
[   10.284456] 00000070: 00 00 00 00 00 00 00 00 0c 0c 09 03 14 00 00 19  ................
[   10.290262] XFS (sdc): SB validate failed with error -74.
[   10.295419] ignition[1190]: CRITICAL : mount: op(2): [failed]   mounting "/dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0" at "/sysroot/var" with type "xfs" and options "": exit status 32: Cmd: "mount" "-o" "" "-t" "xfs" "/dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0" "/sysroot/var" Stdout: "" Stderr: "mount: /sysroot/var: mount(2) system call failed: Structure needs cleaning.\n"
[[0;1;31mFAILED[0m] Failed to start [0;1;39mIgnition (mount)[0m.
See 'systemctl status ignition-mount.service' for details.

full log: frodo.17de004c-a65a-43fa-94f2-e45882ac5195.serialconsole.txt

stereobutter commented 2 years ago

Disclaimer: Please excuse if the filesystem issue above isn't related to azure disk and should be its own issue.

Using ext4 instead of xfs in the butane definition for the filesystem, a new VM that gets a data disk from a preexisting snapshot boots successfully without error. However the disk apparently gets wiped in the process which I though wipe_table prevented? I validated that creating an azure managed disk from my snapshot works by mounting yet another copy of said snapshot into the VM and checking its contents.

Am I doing something wrong with placing /var on another disk and recreating the VM?

bgilbert commented 2 years ago

There's nothing in your config that wipes the disk. wipe_table: false is the default (which just controls the partition table itself) and you haven't set wipe_filesystem: true on the data filesystem. It might be worth checking Azure's handling of the data disk to verify that data is being correctly preserved. You can check whether Ignition is a factor here by removing the partition and filesystem declarations and performing the formatting/mounting by hand.

Is it possible that you created the snapshot before all of the new filesystem metadata was written back to the disk?

stereobutter commented 2 years ago

It might be worth checking Azure's handling of the data disk to verify that data is being correctly preserved

I created a second disk from the same snapshot and mounted that into my VM after ignition ran and the data is there so I don't believe this is the issue.

You can check whether Ignition is a factor here by removing the partition and filesystem declarations and performing the formatting/mounting by hand.

I'm not sure I follow. Removing the partition and filesystem declarations is easy enough but what exactly do you mean by "formatting/mounting by hand"? I assume you mean during the ignition run? Can you give any directions on how to do this? Alternatively are there any logs that ignition leaves behind that could help figure out whether ignition is the culprit?

Is it possible that you created the snapshot before all of the new filesystem metadata was written back to the disk?

After carefully reading https://coreos.github.io/ignition/operator-notes/#filesystem-reuse-semantics and especially the part about matching labels and uuid I explicitly set both label and uuid in my config to rule out that the metadata is the issue.

The following is from a VM that was created with a preexisting filesystem with label data and uuid e1d21e70-080b-4cd2-a51b-4f58496b90fc that I put /var on (/dev/sdc). After ignition ran I attached another disk also created from the original snapshot (/dev/sdd) and both label and uuid match. (You might have to scroll the output to the right to see the UUID column)

$ sudo blkid -o list
device                                             fs_type         label            mount point                                            UUID
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/dev/sdd                                           ext4            data             (not mounted)                                          e1d21e70-080b-4cd2-a51b-4f58496b90fc
/dev/sdb1                                          ntfs            Temporary Storage (not mounted)                                         DCF2DA34F2DA131E
/dev/sdc                                           ext4            data             /var                                                   e1d21e70-080b-4cd2-a51b-4f58496b90fc
/dev/sda4                                          xfs             root             /sysroot                                               033e9584-3979-4ec8-a24b-fd0c98651172
/dev/sda2                                          vfat            EFI-SYSTEM       (not mounted)                                          D3AE-F344
/dev/sda3                                          ext4            boot             /boot                                                  bbed36ea-ae69-43c7-862b-6d2fd1a273ec
/dev/sda1
bgilbert commented 2 years ago

Removing the partition and filesystem declarations is easy enough but what exactly do you mean by "formatting/mounting by hand"?

Boot the node, SSH to it, create the partition table and filesystem, mount the filesystem, snapshot the disk, boot another node.

Alternatively are there any logs that ignition leaves behind that could help figure out whether ignition is the culprit?

Yes, you can use journalctl -t ignition to see the Ignition logs.

Is it possible that you created the snapshot before all of the new filesystem metadata was written back to the disk?

After carefully reading https://coreos.github.io/ignition/operator-notes/#filesystem-reuse-semantics and especially the part about matching labels and uuid I explicitly set both label and uuid in my config to rule out that the metadata is the issue.

That doesn't rule it out though. Did you snapshot the VM while it was still running, or after it was properly shut down?

After ignition ran I attached another disk also created from the original snapshot (/dev/sdd) and both label and uuid match.

Are you still seeing the reuse failure in that case?

stereobutter commented 2 years ago

I repeated my experiment once again this morning and still the data disk seems to get wiped and the filesystem recreated by ignition every time.

Steps to reproduce

  1. Create a azure managed disk storage
  2. Create a VM frodo from FCOS image as described in https://docs.fedoraproject.org/en-US/fedora-coreos/provisioning-azure/
    • attach storage as LUN 0
    • pass the ignition file created from the butane config below as Custom data
  3. ssh into frodo and touch hello.txt (in /var/home/core)
  4. systemctl poweroff and wait for VM to be stopped
  5. Create a snapshot snapshot of storage
  6. Create a azure managed disk copy from snapshot
  7. Create a VM samwise as before
    • attach copy as LUN 0
    • pass again the ignition file created from the butane config below as Custom data
  8. ssh into samwise and observe ls not showing hello.txt

Findings

This time I pulled the logs for frodo and samwise using journalctl -t ignition as you suggested and maybe I've found the issue. Contained in both logs is a line

found  filesystem at "/dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0" with uuid "" and label ""

which is probably expected for frodo as that VM starts out with a fresh/empty disk. For samwise however I would have expected ignition to report finding a filesystem with proper label (data) and uuid (e1d21e70-080b-4cd2-a51b-4f58496b90fc). Is this the reason ignition wipes the disk/recreates the filesystem when it runs on samwise?

When I create another disk sanity_check from snapshot and attach that to samwise as LUN 1, mount it somewhere and check the filesystem and label using sudo blkid -o list it reports the correct metadata for that disk. Also hello.txt is there as expected

device                                  fs_type        label           mount point                                 UUID
--------------------------------------------------------------------------------------------------------------------------------------------------------
/dev/sdd                                ext4           data            /var/home/core/data2                        e1d21e70-080b-4cd2-a51b-4f58496b90fc

butane config

variant: fcos
version: 1.4.0
passwd:
  users:
    - name: core
      password_hash: $1$KQSW9Uq/$yNAkRIbQvKGKPVdspcjEq0
      ssh_authorized_keys:
        - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDeMUtyZtnfbT6KxQCuC3wgLH06xxlHs1Tvd5o9epuTPA9soEEO0LfLdhv9eDDB0XZ47yrfHMwn3l8ZLWbXA6EQ6W2NQbeZWRC17Xez3fvS9jUG0JKCbonhrZxveKABisbpvnQf3BtMgGRwygVMLG4gTO/goA0Yjy6WJQeQiNATKQbdR1mAtQhDk6BkK1EBA8EYHNvK7JVlsOLtSO4fq8k84ijRIzENO13jBf8z6C3qcaJ/PT49DsrFIz8XBXqs3qSO+0N3wiXp3RRFh0GnpUkVTXKi9bryNoJ+mXcjUJUF6+CyJiqZ41mjxrDq167kDbjrxhwzeLReT+kikCR/6wT91PugXb7JjH1DXMgQADlla3HG7mpo6J5llQc1LZee7Sa0zTdVOMCxuAK/kJSfnlsnPx4tI7qyRYuO/KM+i2uSDWFwa5EAfvKZUnilKt3aW08hylvrN+BwRqiJZ6jVpUZK8oLfHPgU4M/N00edJgTx0L2oyaIb2woBQskFjktDXhMcdlzXoPCEdMsE2dCT1BXrCpBkUyWAxJg32VGQfSn2i2PQx5jM51B5Bl8xxtf5vsogwCcGqOGN5KNVUcxYMtGy99tsLIr/vZCgiqnA3WPXsGv5N5WSP02OtiJ81uLz9UDROSz13bBGJ1lZqhT3IO+1SNb5Ao8Z5777ouap1OWGcw== supersecret
storage:
  disks:
  - device: /dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0
    wipe_table: false
    partitions:
    - size_mib: 0
      start_mib: 0
      label: user_data
  filesystems:
    - path: /var
      device: /dev/disk/by-path/acpi-VMBUS:00-vmbus-f8b3781b1e824818a1c363d806ec15bb-lun-0
      format: ext4
      with_mount_unit: true
      label: data
      uuid: e1d21e70-080b-4cd2-a51b-4f58496b90fc
  files:
  - path: /etc/udev/rules.d/66-azure-storage.rules
    mode: 0750
    contents: 
      source: https://raw.githubusercontent.com/Azure/WALinuxAgent/master/config/66-azure-storage.rules
  - path: /etc/udev/rules.d/99-azure-product-uuid.rules
    mode: 0750
    contents: 
      source: https://raw.githubusercontent.com/Azure/WALinuxAgent/master/config/99-azure-product-uuid.rules
bgilbert commented 2 years ago

Ahh, I just noticed the problem. Your disks section thinks you're putting the data filesystem on partition 1, but your filesystems section thinks you're putting it directly on an unpartitioned disk. As a result, the partition table is overwriting the start of the filesystem and the filesystem is overwriting the partition table. The fix is to have the filesystems section refer to the partition created in the disks section.

bgilbert commented 2 years ago

Filed https://github.com/coreos/ignition/issues/1397 to automatically generate a warning in this case.

stereobutter commented 2 years ago

Thank you a lot for you patience 🙇 Using /dev/disk/by-partlabel/user_data for the filesystem worked as expected 🥳

If you can perhaps point me in the right direction, I'd be happy to try and contribute using the WALA udev rules for azure disks either in the docs or directly in the fcos images for azure. What do you think?

bgilbert commented 2 years ago

Contributions are welcome if you're able to help! Ideally the rules would be included in a Fedora package that we could ship. Failing that, we could consider shipping them directly in fedora-coreos-config. I don't think the docs are a good way to proceed, especially since any udev rules specified via Ignition config don't take affect until after Ignition runs.

dustymabe commented 2 years ago

Regarding udev rules we already include the WALinuxAgent RPM in FCOS and have some glue to copy the udev rules into the initramfs.

One problem I see is that the RPM is a bit out of date with the latest release so maybe we just need to get the maintainer to bump things? --> Open BZ requesting new release be built: https://bugzilla.redhat.com/show_bug.cgi?id=2040980

jlebon commented 7 months ago

The nuclear workaround for this is documented for now at https://github.com/openshift/os/blob/master/docs/faq.md#q-how-do-i-configure-a-secondary-block-device-via-ignitionmc-if-the-name-varies-on-each-node.