canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.87k stars 856 forks source link

net.get_interfaces() raises FileNotFoundError when network interface is renamed during enumeration #5108

Open cjp256 opened 5 months ago

cjp256 commented 5 months ago

device_driver() & get_interfaces() may need some work to be resilient if an interface is being renamed while enumerating devices:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 860, in _get_data
    crawled_data = util.log_time(
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2808, in log_time
    ret = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 59, in impl
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 639, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 59, in impl
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 424, in _setup_ephemeral_networking
    % (iface, net.get_interfaces()),
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 1079, in get_interfaces
    driver = device_driver(name)
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 364, in device_driver
    driver = os.path.basename(os.readlink(driver_path))
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/net/ib0/device/driver'

Around the same time we can see the interfaces are being renamed:

[   36.915981] mlx5_core 0101:00:00.0 ibP257s165943: renamed from ib0
[   36.953812] mlx5_core 0102:00:00.0 ibP258s165981: renamed from ib0
[   37.057541] mlx5_core 0103:00:00.0 ibP259s166052: renamed from ib0
[   37.104314] mlx5_core 0104:00:00.0 ibP260s166131: renamed from ib0
[   37.157170] mlx5_core 0105:00:00.0 ibP261s166145: renamed from ib0
[   37.208995] mlx5_core 0106:00:00.0 ibP262s166254: renamed from ib0
[   37.240758] mlx5_core 0107:00:00.0 ibP263s166326: renamed from ib0
[   37.276640] mlx5_core 0108:00:00.0 ibP264s166393: renamed from ib0
holmanb commented 5 months ago

device_driver() & get_interfaces() may need some work to be resilient if an interface is being renamed while enumerating devices:

Agreed, but I think the problem is bigger than these two callsites.

A retry in get_interfaces() on FileNotFoundError would prevent this exception, but even with such a change, if get_interfaces() gets called prior to a rename, then we're going to see the same issue but elsewhere (such as in one of the read_sys_net_safe() calls in find_candidate_nics_on_linux() - which would probably be more difficult to debug because it wouldn't traceback). Anywhere that expects to use the returned interface name may also see this issue.

Do you know if the rename is triggered by userspace? Also, what distro/release did you see this on?

cjp256 commented 5 months ago

It's udev renaming them from some generic ib0 -> ibP257s165943. The devices are just be enumerated late in the process for whatever reason. But we can tell it's being enumerated because they are each showing up as ib0 before getting renamed.