canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 883 forks source link

hotplug: high latency when doing driver reload #5706

Open aciba90 opened 2 months ago

aciba90 commented 2 months ago

Bug report

cloud-init's networking hotpug functionality waits and retries unnecessarily if one reloads a NIC's driver.

If one executes modprobe --remove ena then the kernel does trigger an udev remove event on the NICs using that driver and cloud-init's hotplug reacts to it:

2024-09-16 11:44:53,003 - hotplug_hook.py[DEBUG]: hotplug-hook called with the following arguments: {hotplug_action: handle, subsystem: net, udevaction: remove, devpath: /devices/pci0000:00/0000:00:05.0/net/ens5}

This does instruct cloud-init to re-fetch the datasource information and re-configure the network. But, as per AWS and the IMDS those NICs are still present in the IMDS and thus, cloud-init does not remove them. Then, the hotplug functionality realises that (that the udev-removed NIC is still present in the network configuration) and keeps waiting and retrying for a while.

Additional info

SF: #395673

Steps to reproduce the problem

  1. Launch an ec2 instance
  2. Execute modprobe --remove ena && modprobe ena
  3. Check cloud-init logs and see how the udev remove handler waits and retries for minute or so.

Alternatively, apply this patch and execute:

diff --git a/tests/integration_tests/modules/test_hotplug.py b/tests/integration_tests/modules/test_hotplug.py
index a41b2313c..f9a6266d6 100644
--- a/tests/integration_tests/modules/test_hotplug.py
+++ b/tests/integration_tests/modules/test_hotplug.py
@@ -545,3 +545,18 @@ def test_nics_before_config_trigger_hotplug(client: IntegrationInstance):

     client.instance.remove_network_interface(added_ip_0)
     _wait_till_hotplug_complete(client, expected_runs=2)
+
+
+@pytest.mark.skipif(PLATFORM != "ec2", reason="test is ec2 specific")
+@pytest.mark.user_data(USER_DATA)
+def test_modprobe(client: IntegrationInstance):
+    client.execute("modprobe -r ena && modprobe ena", use_sudo=True)
+    _wait_till_hotplug_complete(client, expected_runs=1)
+    log_content = client.read_from_file("/var/log/cloud-init.log")
+    verify_clean_log(log_content)
+    verify_clean_boot(client)

Environment details

cloud-init logs

cloud-init(2).tar.gz