amazonlinux / amazon-ec2-utils

amazon-ec2-utils contains a set of utilities and settings for Linux deployments in EC2
MIT License
56 stars 26 forks source link

ebsnvme-id creates broken sd* symlinks #37

Open martinpitt opened 3 months ago

martinpitt commented 3 months ago

We spent quite some time debugging a storage test regression in Fedora rawhide which essentially breaks scsi_debug and other devices, but only on RedHat's/Fedora's Testing Farm infrastructure -- which is essentially AWS EC2 machines with an API.

Latest Fedora rawhide instances now have amazon-ec2-utils-2.2.0-2.fc41.noarch (which got introduced into Fedora very recently), which ships /usr/lib/udev/rules.d/70-ec2-nvme-devices.rules with

KERNEL=="nvme[0-9]*n[0-9]*",        ENV{DEVTYPE}=="disk",      ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/sbin/ebsnvme-id -u /dev/%k", SYMLINK+="%c"
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/sbin/ebsnvme-id -u /dev/%k", SYMLINK+="%c%n"

These instances have an NVME block device, and these rules cause the following symlinks to be created:

lrwxrwxrwx. 1 root root 7 May 29 03:52 /dev/sda1 -> nvme0n1
lrwxrwxrwx. 1 root root 9 May 29 03:52 /dev/sda11 -> nvme0n1p1
lrwxrwxrwx. 1 root root 9 May 29 03:52 /dev/sda12 -> nvme0n1p2
lrwxrwxrwx. 1 root root 9 May 29 03:52 /dev/sda13 -> nvme0n1p3
lrwxrwxrwx. 1 root root 9 May 29 03:52 /dev/sda14 -> nvme0n1p4

This is problematic in multiple ways:

If then a real sda comes along (e.g. with modprobe scsi_debug), this will create an actual /dev/sda, but then it's impossible to create/see partitions on that, as the sda1 etc. names are already taken.

This is most easily reproduced with

# /usr/sbin/ebsnvme-id -u /dev/nvme0n1
sda1

Curiously, it also does that for a partition:

# /usr/sbin/ebsnvme-id -u /dev/nvme0n1p2
sda1

that explains how the second udev rule can even work -- but this is really hackish!

My recommendation as former udev co-upstream is to just entirely remove these rules. They are not helpful, confusing, and break stuff. You can of course create symlinks in subdirs of /dev all you like, but please don't collide with kernel names.

Thanks!

martinpitt commented 3 months ago

@mvollmer @major FYI -- @major, do you want me to file this as a Fedora bz, too? This could very well affect other rawhide users/tests, and it has already cost us about 10 hours of our lives..

martinpitt commented 3 months ago

Note: This only affects Fedora rawhide because Testing Farm Fedora 40 instances don't install amazon-ec2-utils by default. When I install it manually, the issue happens there as well.

mvollmer commented 3 months ago

@martinpitt, thanks for filing this! I have a hard time understanding what problem these symlinks are trying to solve. They only seem to create chaos.

If they are supposed to help with giving stable names to NVMe drives, I think that problem is already solved by ID_SERIAL, ID_WWN, and filesystem UUIDs.

martinpitt commented 3 months ago

https://gitlab.com/testing-farm/infrastructure doesn't actually install that package -- I figure it's now part of the official Fedora rawhide AMIs?

major commented 3 months ago

@mvollmer @major FYI -- @major, do you want me to file this as a Fedora bz, too? This could very well affect other rawhide users/tests, and it has already cost us about 10 hours of our lives..

@martinpitt That would be helpful. Thanks for detailing out the problems you found. I missed these during testing!

martinpitt commented 3 months ago

@major OK, I filed https://bugzilla.redhat.com/show_bug.cgi?id=2284397 . Thanks!