canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.9k stars 863 forks source link

Unable to find a system nic while running cloud-init #4125

Open xiaoge1001 opened 1 year ago

xiaoge1001 commented 1 year ago

Bug report

I run "cloud-init status", I found that cloud-init failed to executel

Steps to reproduce the problem

Reboot computes

Environment details

cloud-init logs

2023-05-18 02:48:15,451 - util.py[DEBUG]: failed stage init-local Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 689, in status_wrapper ret = functor(name, args) File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 398, in main_init init.apply_network_config(bring_up=bring_up_interfaces) File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 789, in apply_network_config netcfg, src = self._find_networking_config() File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 740, in _find_networking_config if self.datasource and hasattr(self.datasource, 'network_config'): File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceConfigDrive.py", line 158, in network_config self._network_config = openstack.convert_net_json( File "/usr/lib/python3.9/site-packages/cloudinit/sources/helpers/openstack.py", line 698, in convert_net_json raise ValueError("Unable to find a system nic for %s" % d) ValueError: Unable to find a system nic for {'mtu': 1500, 'type': 'physical', 'subnets': [{'type': 'static', 'netmask': '255.255.255.0', 'routes': [{'netmask': '0.0.0.0', 'network': '0.0.0.0', 'gateway': '192.168.50.1'}], 'address': '192.168.50.19', 'ipv4': True}], 'mac_address': 'fa:16:3e:7c:49:9f'}

xiaoge1001 commented 1 year ago

Same question: https://askubuntu.com/questions/1400527/unable-to-find-a-system-nic-while-running-cloud-init

xiaoge1001 commented 1 year ago

I add "ExecStartPre=sleep 30" in Service section of cloud-init-local.service. This avoids the problem. But I think that it's not a fundamental solution.

radeksm commented 1 year ago

I see similar problem on 22.1-5.el8.0.1, this is what comes from RHEL 8.6:

May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd-hostnamed[7896]: Changed static host name to 'overcloud-controller-cbis24-375-0' May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd-hostnamed[7896]: Changed host name to 'overcloud-controller-cbis24-375-0' May 18 12:06:41 overcloud-controller-cbis24-375-0 ovs-vsctl[7898]: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: Cloud-init v. 22.1-5.el8.0.1 running 'init-local' at Thu, 18 May 2023 16:06:40 +0000. Up 91.27 seconds.
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: 2023-05-18 16:06:41,214 - util.py[WARNING]: failed stage init-local
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: failed run of stage init-local
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ------------------------------------------------------------
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: Traceback (most recent call last):
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/cmd/main.py", line 761, in status_wrapper
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ret = functor(name, args)
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/cmd/main.py", line 433, in main_init
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: init.apply_network_config(bring_up=bring_up_interfaces)
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 869, in apply_network_config
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: netcfg, src = self._find_networking_config()
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 812, in _find_networking_config
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: if self.datasource and hasattr(self.datasource, "network_config"):
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/sources/DataSourceConfigDrive.py", line 169, in network_config
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: self.network_json, known_macs=self.known_macs
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/sources/helpers/openstack.py", line 731, in convert_net_json
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: raise ValueError("Unable to find a system nic for %s" % d) May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ValueError: Unable to find a system nic for {'type': 'physical', 'mtu': 9000, 'subnets': [{'type': 'dhcp4'}], 'mac_addre>
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ------------------------------------------------------------
May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd[1]: cloud-init-local.service: Main process exited, code=exited, status=1/FAILURE
May 18 12:06:41 overcloud-controller-cbis24-375-0 dbus-daemon[7630]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.> May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd[1]: cloud-init-local.service: Failed with result 'exit-code'.
May 18 12:06:41 overcloud-controller-cbis24-375-0 dbus-daemon[7630]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'

While I do see the interface with expected MAC address later, my guess is that the NIC with expected MAC address is not ready or does not exist at the time cloud-init tries to configure it. It is likely to happen when you have some SRIOV NICs witch large number of virtual functions.

aciba90 commented 1 year ago

Thanks @xiaoge1001 for reporting this issue.

I am not able to reproduce this exact behavior. Could you please provide steps to reproduce and attach the logs?

I mark this issue as incomplete in the meanwhile.

xiaoge1001 commented 6 months ago

Thanks @xiaoge1001 for reporting this issue.

I am not able to reproduce this exact behavior. Could you please provide steps to reproduce and attach the logs?

I mark this issue as incomplete in the meanwhile.

Sorry, this is an occasional problem, I do not reproduce the issue. But I adopted the suggestion here at that time. link: https://askubuntu.com/questions/1400527/unable-to-find-a-system-nic-while-running-cloud-init

holmanb commented 5 months ago

This is a duplicate of https://github.com/canonical/cloud-init/issues/3523.

holmanb commented 5 months ago

Cloud-init doesn't hit this codepath until after systemd-networkd-wait-online.service (and the networkmanager equivalent). While this guarantees that an interface will be ready in userspace, it does not guarantee that all expected interfaces will be ready in userspace. Interfaces may be loaded late by the kernel whem the driver is not loaded by the initramfs. This is what causes this error.

This exception apparently exists in order to validate that invalid network configuration isn't passed to cloud-init. This gets thrown when rendering a configuration for openstack which contains a network device hasn't been loaded yet.

When an interface loud-init has a few options:

a) assume (incorrectly) that if the device is not yet visible in userspace that writing out a configuration with this device will break network backends, traceback without configuring network at all.

Pros: cloud-init can log warnings when openstack passes invalid configuration Cons: breaks perfectly valid configurations (as reported multiple times).

b) warn about the device not existing and drop it from the configuration (previously proposed)

Pros: breaks perfectly valid configurations slightly less than current state by rendering network configuration for all interfaces that are currently up Cons: still breaks valid configurations that include late-loaded interfaces

c) block until the interface that openstack told us about is available

Pros: will always "work" when valid configuration is passed Cons: worst case failure path with invalid configuration (instance hangs), unnecessarily slow boot

d) trust that network daemons can handle configurations which reference interfaces which might load late[1]: don't throw an exception, don't log an error, but do log an info/debug about the missing interface in case openstack did pass invalid network configuration

Pros: will always "work" when valid configuration is passed Cons: no warning logged or exception thrown when invalid configuration is received from openstack

There is no clear "best" choice, this is an issue that has engineering tradoffs to consider.

Cloud-init currently does a). The tradoffs are between behaving correctly, affecting bootspeed, failure path behavior, and failure path ease of debugging.

If priority is to work correctly when correct configuration is passed is the priority, then c) or d) are probably superior. Preference to d) for better failure path behavior and no negative impact to boot speed.

If priority is logging noisily when invalid network configuration is passed (which would then require users to modify the kernel with built-in drivers or modify initrd to load drivers), then a) or b) are probably superior.

I haven't investigated when openstack might pass invalid configuration. Can users provide freeform network configuration or is this a machine-generated configuration that is generated either dynamically or via structured user input forms? If users can't pass free-form configuration, then a) and b) would seem significant less practical. Even if they can, I think that I would prefer to do the right thing if possible rather than maintain the ability to warn but break on false positives.

[1] They must - interfaces load late all the time on many platforms. Common offenders include virtio and high speed network adapters.