canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.92k stars 870 forks source link

growpart with zfs fails where vdev is a partition uuid #5788

Open candlerb opened 3 days ago

candlerb commented 3 days ago

Bug report

Given a ZFS image, where the rpool vdev is XXXX referring to /dev/disk/by-partuuid/XXXX, growfs tries (and fails) to grow /dev/XXXX instead.

Steps to reproduce the problem

The image that I'm booting has ZFS as per the Ubuntu root on ZFS documentation, using partition uuids:

PART3="/dev/disk/by-partuuid/$(sgdisk -i3 "$DISK" | grep "Partition unique GUID" | sed "s/^.*: //" | tr "A-Z" "a-z")"
PART4="/dev/disk/by-partuuid/$(sgdisk -i4 "$DISK" | grep "Partition unique GUID" | sed "s/^.*: //" | tr "A-Z" "a-z")"

zpool create \
...
    bpool "$PART3"

zpool create \
...
    rpool "$PART4"

The reason for using partuuids is because I found that /dev/disk/by-id/scsi-SATA_disk1 was not stable between different virtualization environments, causing it to fail to boot. It boots fine everywhere using the partition uuid.

The original disk image was 10GiB, which I've resized to 60GiB using qemu-img, and I want cloud-init to grow into that space.

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bpool  1.88G  46.7M  1.83G        -         -     0%     2%  1.00x    ONLINE  -
rpool     7G  4.57G  2.43G        -         -    39%    65%  1.19x    ONLINE  -

# zpool status rpool
  pool: rpool
 state: ONLINE
config:

    NAME                                    STATE     READ WRITE CKSUM
    rpool                                   ONLINE       0     0     0
      d2250236-f96c-4664-97a2-75369a4c8e47  ONLINE       0     0     0

errors: No known data errors

The vdev device reference there is a partition UUID:

# ls -l /dev/disk/by-partuuid/
total 0
lrwxrwxrwx 1 root root 10 Oct  6 13:53 10ca2987-3278-427f-8e0a-7b616ebc3689 -> ../../vda5
lrwxrwxrwx 1 root root 10 Oct  6 13:53 2ca6c905-7841-4d1c-b40b-851770c1f9da -> ../../vda3
lrwxrwxrwx 1 root root 10 Oct  6 13:53 505bbfb3-8a15-4e64-a51d-97fe18318c31 -> ../../vda1
lrwxrwxrwx 1 root root 10 Oct  6 13:53 d2250236-f96c-4664-97a2-75369a4c8e47 -> ../../vda4   << THIS ONE
lrwxrwxrwx 1 root root 10 Oct  6 13:53 e97c98ed-bc82-422b-ac07-7d876a1e6a90 -> ../../vda2

vda4 is the last partition on disk, and that's what I want cloud-init to expand.

cloud-init data has no growpart config, so it's using the default {'mode': 'auto', 'devices': ['/'], 'ignore_growroot_disabled': False} which should use all available free space.

The problem is that cloud-init is trying to grow /dev/d2250236-f96c-4664-97a2-75369a4c8e47 instead of /dev/disk/by-partuuid/d2250236-f96c-4664-97a2-75369a4c8e47 (see logs below).

# fdisk -l /dev/vda
GPT PMBR size mismatch (20971519 != 125829119) will be corrected by write.
The backup GPT table is not on the end of the device.
Disk /dev/vda: 60 GiB, 64424509440 bytes, 125829120 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: C0E0A64A-3DD3-47E1-9DC5-191159A93244

Device       Start      End  Sectors  Size Type
/dev/vda1     2048  1050623  1048576  512M EFI System
/dev/vda2  1050624  2074623  1024000  500M Linux swap
/dev/vda3  2074624  6268927  4194304    2G Solaris boot
/dev/vda4  6268928 20971486 14702559    7G Solaris root
/dev/vda5       48     2047     2000 1000K BIOS boot

Partition table entries are not in disk order.

Environment details

cloud-init logs

cloud-init-output.log shows:

2024-10-05 11:01:32,430 - cc_resizefs.py[WARNING]: Device '/dev/d2250236-f96c-4664-97a2-75369a4c8e47' did not exist. cannot resize: dev=d2250236-f96c-4664-97a2-75369a4c8e47 mnt_point=/ path=rpool

cloud-init.log shows:

2024-10-05 11:01:32,420 - modules.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/c
c_growpart.py'>) with frequency always
2024-10-05 11:01:32,420 - handlers.py[DEBUG]: start: init-network/config-growpart: running config-growpart with frequency always
2024-10-05 11:01:32,420 - helpers.py[DEBUG]: Running config-growpart using lock (<cloudinit.helpers.DummyLock object at 0x74fd332608c0>)
2024-10-05 11:01:32,420 - cc_growpart.py[DEBUG]: No 'growpart' entry in cfg.  Using default: {'mode': 'auto', 'devices': ['/'], 'ignore_growroot_disabled': False}
2024-10-05 11:01:32,420 - subp.py[DEBUG]: Running command ['growpart', '--help'] with allowed return codes [0] (shell=False, capture=True)
2024-10-05 11:01:32,424 - util.py[DEBUG]: Reading from /proc/946/mountinfo (quiet=False)
2024-10-05 11:01:32,424 - util.py[DEBUG]: Read 2592 bytes from /proc/946/mountinfo
2024-10-05 11:01:32,424 - cc_growpart.py[DEBUG]: growpart found fs=zfs
2024-10-05 11:01:32,424 - util.py[DEBUG]: resize_devices took 0.000 seconds
2024-10-05 11:01:32,424 - cc_growpart.py[DEBUG]: '/' SKIPPED: stat of 'rpool/ROOT/ubuntu_nsrc00' failed: [Errno 2] No such file or directory: 'rpool/ROOT/ubuntu_nsrc00'
2024-10-05 11:01:32,424 - handlers.py[DEBUG]: finish: init-network/config-growpart: SUCCESS: config-growpart ran successfully
2024-10-05 11:01:32,424 - modules.py[DEBUG]: Running module resizefs (<module 'cloudinit.config.cc_resizefs' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_resizefs.py'>) with frequency always
2024-10-05 11:01:32,424 - handlers.py[DEBUG]: start: init-network/config-resizefs: running config-resizefs with frequency always
2024-10-05 11:01:32,425 - helpers.py[DEBUG]: Running config-resizefs using lock (<cloudinit.helpers.DummyLock object at 0x74fd3327c920>)
2024-10-05 11:01:32,425 - util.py[DEBUG]: Reading from /proc/946/mountinfo (quiet=False)
2024-10-05 11:01:32,425 - util.py[DEBUG]: Read 2592 bytes from /proc/946/mountinfo
2024-10-05 11:01:32,425 - subp.py[DEBUG]: Running command ['zpool', 'status', 'rpool'] with allowed return codes [0] (shell=False, capture=True)
2024-10-05 11:01:32,430 - cc_resizefs.py[DEBUG]: found zpool "rpool" on disk d2250236-f96c-4664-97a2-75369a4c8e47
2024-10-05 11:01:32,430 - cc_resizefs.py[DEBUG]: resize_info: dev=d2250236-f96c-4664-97a2-75369a4c8e47 mnt_point=/ path=rpool
2024-10-05 11:01:32,430 - cc_resizefs.py[DEBUG]: 'd2250236-f96c-4664-97a2-75369a4c8e47' doesn't appear to be a valid device path. Trying '/dev/d2250236-f96c-4664-97a2-75369a4c8e47'
2024-10-05 11:01:32,430 - cc_resizefs.py[WARNING]: Device '/dev/d2250236-f96c-4664-97a2-75369a4c8e47' did not exist. cannot resize: dev=d2250236-f96c-4664-97a2-75369a4c8e47 mnt_point=/ path=rpool
2024-10-05 11:01:32,430 - handlers.py[DEBUG]: finish: init-network/config-resizefs: SUCCESS: config-resizefs ran successfully

Workaround

I can grow the partition by hand (and therefore potentially as a cmd in cloud-init)

# growpart /dev/vda 4
CHANGED: partition=4 start=6268928 old: size=14702559 end=20971486 new: size=119560159 end=125829086
# blockdev --getsize64 /dev/vda4
61214801408

Growing the zfs vdev requires the partition uuid to be used as the vdev identifier:

# zpool online -e rpool /dev/vda4
# zpool list rpool
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool     7G  4.57G  2.43G        -         -    40%    65%  1.19x    ONLINE  -
# zpool online -e rpool d2250236-f96c-4664-97a2-75369a4c8e47
# zpool list rpool
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool    57G  4.57G  52.4G        -         -     4%     8%  1.19x    ONLINE  -

Suggestion

Given a vdev device name without a /dev prefix, I think it would be reasonable for cloud-init to try it in turn under:

There's a simpler answer: use the -P and/or -L flags to get the full path and/or the resolved device name.

# zpool status -P rpool
  pool: rpool
 state: ONLINE
config:

    NAME                                                          STATE     READ WRITE CKSUM
    rpool                                                         ONLINE       0     0     0
      /dev/disk/by-partuuid/d2250236-f96c-4664-97a2-75369a4c8e47  ONLINE       0     0     0

errors: No known data errors

# zpool status -L rpool
  pool: rpool
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      vda4      ONLINE       0     0     0

errors: No known data errors

# zpool status -LP rpool
  pool: rpool
 state: ONLINE
config:

    NAME         STATE     READ WRITE CKSUM
    rpool        ONLINE       0     0     0
      /dev/vda4  ONLINE       0     0     0

errors: No known data errors
candlerb commented 3 days ago

This workaround seems to do the trick:

 runcmd:
  - growpart $(zpool status -LP rpool | awk '/\/dev/ {print $1}' | sed -E 's/([0-9]+)$/ \1/') || true
  - zpool online -e rpool $(zpool status -P rpool | awk '/\/dev/ {print $1}' | sed 's/.*\///') || true

The first command resolves to /dev/vda4, which it splits to growpart /dev/vda 4

The second resolves to the block identifier used by the vdev.

TheRealFalcon commented 1 day ago

Thank you for the detailed bug report and workaround!