canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 880 forks source link

disk_setup (and probably fs_setup) does not work with nvme drives #5246

Closed nilo85 closed 5 months ago

nilo85 commented 6 months ago

Bug report

When trying to partition a nvme ssd, it seems the cloud init assumes bad name convention for partitions. It outputs

Failed during disk check for /dev/nvme0n11

Where the real name convention is /dev/nvme0n1p1

Steps to reproduce the problem

disk_setup:
  "/dev/nvme0n1":
    table_type: gpt
    layout: [[5, 82], [95, 83]]
    overwrite: true
sudo cloud-init single --name disk_setup --frequency always
Cloud-init v. 24.1.3-0ubuntu3 running 'single' at Wed, 01 May 2024 11:22:10 +0000. Up 1916.07 seconds.
2024-05-01 11:22:10,488 - util.py[WARNING]: Failed during filesystem operation
Failed during disk check for /dev/nvme0n11
Unexpected error while running command.
Command: ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n11', '--nodeps']
Exit code: 32
Reason: -
Stdout:
Stderr: lsblk: /dev/nvme0n11: not a block device
2024-05-01 11:22:10,511 - util.py[WARNING]: Failed during filesystem operation
Failed during disk check for /dev/nvme0n12
Unexpected error while running command.
Command: ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n12', '--nodeps']
Exit code: 32
Reason: -
Stdout:
Stderr: lsblk: /dev/nvme0n12: not a block device

/usr/bin/lsblk --pairs --output NAME,TYPE,FSTYPE,LABEL /dev/nvme0n1
NAME="nvme0n1" TYPE="disk" FSTYPE="" LABEL=""
NAME="nvme0n1p1" TYPE="part" FSTYPE="swap" LABEL=""
NAME="nvme0n1p2" TYPE="part" FSTYPE="" LABEL=""

Environment details

cloud-init logs

2024-05-01 11:29:57,240 - cc_disk_setup.py[DEBUG]: Creating new filesystem.
2024-05-01 11:29:57,240 - subp.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2024-05-01 11:29:57,249 - cc_disk_setup.py[DEBUG]: Checking /dev/nvme0n1 against default devices
2024-05-01 11:29:57,249 - cc_disk_setup.py[DEBUG]: Manual request of partition 2 for /dev/nvme0n12
2024-05-01 11:29:57,250 - cc_disk_setup.py[DEBUG]: Checking device /dev/nvme0n12
2024-05-01 11:29:57,250 - subp.py[DEBUG]: Running command ['/usr/sbin/blkid', '-c', '/dev/null', '/dev/nvme0n12'] with allowed return codes [0, 2] (shell=False, capture=True)
2024-05-01 11:29:57,251 - cc_disk_setup.py[DEBUG]: Device '/dev/nvme0n12' has check_label='None' check_fstype=None
2024-05-01 11:29:57,251 - cc_disk_setup.py[DEBUG]: Device /dev/nvme0n12 is cleared for formatting
2024-05-01 11:29:57,252 - cc_disk_setup.py[DEBUG]: File system type 'ext4' with label 'data' will be created on /dev/nvme0n12
2024-05-01 11:29:57,252 - subp.py[DEBUG]: Running command ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n12', '--nodeps'] with allowed return codes [0] (shell=False, capture=True)
2024-05-01 11:29:57,258 - util.py[DEBUG]: Creating fs for /dev/nvme0n1 took 0.018 seconds
2024-05-01 11:29:57,258 - util.py[WARNING]: Failed during filesystem operation
Failed during disk check for /dev/nvme0n12
Unexpected error while running command.
Command: ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n12', '--nodeps']
Exit code: 32
Reason: -
Stdout:
Stderr: lsblk: /dev/nvme0n12: not a block device
2024-05-01 11:29:57,258 - util.py[DEBUG]: Failed during filesystem operation
Failed during disk check for /dev/nvme0n12
Unexpected error while running command.
Command: ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n12', '--nodeps']
Exit code: 32
Reason: -
Stdout:
Stderr: lsblk: /dev/nvme0n12: not a block device
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_disk_setup.py", line 272, in enumerate_disk
    info, _err = subp.subp(lsblk_cmd)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 298, in subp
    raise ProcessExecutionError(
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n12', '--nodeps']
Exit code: 32
Reason: -
Stdout:
Stderr: lsblk: /dev/nvme0n12: not a block device

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_disk_setup.py", line 157, in handle
    util.log_time(
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2827, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_disk_setup.py", line 1045, in mkfs
    if overwrite or device_type(device) == "disk":
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_disk_setup.py", line 299, in device_type
    for d in enumerate_disk(device, nodeps=True):
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_disk_setup.py", line 274, in enumerate_disk
    raise RuntimeError(
RuntimeError: Failed during disk check for /dev/nvme0n12
Unexpected error while running command.
Command: ['/usr/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/nvme0n12', '--nodeps']
Exit code: 32
Reason: -
Stdout:
Stderr: lsblk: /dev/nvme0n12: not a block device
2024-05-01 11:29:57,259 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
2024-05-01 11:29:57,259 - util.py[DEBUG]: Read 17 bytes from /proc/uptime
2024-05-01 11:29:57,259 - util.py[DEBUG]: cloud-init mode 'single' took 0.147 seconds (0.14)
dermotbradley commented 6 months ago

NVME devices in Linux are not the only storage devices that have this differing form of naming for partitions, other examples would be SD cards (i.e. /dev/mmcblk0p1), partitioned loop devices (i.e. /dev/loop0p1) and I think software RAID devices (i.e. /dev/md0p1).

holmanb commented 6 months ago

@nilo85 thanks for filing this issue! It looks like cloud-init needs to update this code to support various different disk types.

holmanb commented 5 months ago

@nilo85 hey it looks like the config you shared can't invoke the codepath that caused the issue. Can you please provide the full config? I'm guessing that you tried to share just the relevant part and didn't realize that the error came from the fs format failure path.

holmanb commented 5 months ago

@nilo85 hey it looks like the config you shared can't invoke the codepath that caused the issue. Can you please provide the full config? I'm guessing that you tried to share just the relevant part and didn't realize that the error came from the fs format failure path.

confirmed: https://github.com/canonical/cloud-init/pull/5263#issuecomment-2110815684

nilo85 commented 5 months ago

This was my config, then I later removed the alias to "ssd" to remove alias being the issue

device_aliases:
  ssd: /dev/nvme0n1

disk_setup:
  ssd:
    table_type: gpt
    layout: [[5, 82], [95, 83]]

fs_setup:
- label: swap
  device: ssd.1
  filesystem: swap
- label: data
  device: ssd.2
  filesystem: ext4

Pretty sure it is disk setup based on this line in the stacktrace?

File "/usr/lib/python3/dist-packages/cloudinit/config/cc_disk_setup.py", line 157, in handle
    util.log_time(

I tried modifying the python code and added debug output etc but quickly realised why I dont code in python (couldnt make sens of it) ;)

But it seems those enumerate methods are shared cross fs and disk setup

I can try to give it a new try the next days to see if I can get better understanding exactly where we are in the flow.

EDIT: you are probably right, it did create the partition table for me etc, but not any filesystems, so maybe I was just assuming it was disk_setup that was the issue due to the filename.

I thought it was some disk verification step after creating layout that failed