canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
3k stars 885 forks source link

get_interfaces_by_mac_on_linux: RuntimeError: duplicate mac found (driver: mlx5_core) #5794

Open jeremy-oracle opened 1 month ago

jeremy-oracle commented 1 month ago

Bug report

We are creating an instance with multiple network interfaces with the same MAC address on purpose because they are part of the same SR-IOV bond, but cloud-init code throws an exception.

Steps to reproduce the problem

Create an instance connected with 2 or more Mellanox CX5 or CX6 SR-IOV virtual functions with the same MAC address. The driver is mlx5_core.

Environment details

cloud-init logs

[   27.864643] cloud-init[1535]: Cloud-init v. 23.4-7.0.1.el9_4.3 running 'init-local' at Fri, 04 Oct 2024 21:57:35 +0000. Up 27.84 seconds.
[   27.932817] cloud-init[1535]: 2024-10-04 21:57:35,819 - util.py[WARNING]: Failed to parse IMDS network configuration!
[   27.937519] cloud-init[1535]: 2024-10-04 21:57:35,824 - util.py[WARNING]: failed stage init-local
[   27.941049] cloud-init[1535]: failed run of stage init-local
[   27.941601] cloud-init[1535]: ------------------------------------------------------------
[   27.943182] cloud-init[1535]: Traceback (most recent call last):
[   27.943841] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
[   27.945303] cloud-init[1535]:     ret = functor(name, args)
[   27.948436] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 442, in main_init
[   27.949544] cloud-init[1535]:     init.apply_network_config(bring_up=bring_up_interfaces)
[   27.950535] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 997, in apply_network_config
[   27.951721] cloud-init[1535]:     netcfg, src = self._find_networking_config()
[   27.952608] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 936, in _find_networking_config
[   27.954105] cloud-init[1535]:     if self.datasource and hasattr(self.datasource, "network_config"):
[   27.955260] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 273, in network_config
[   27.956327] cloud-init[1535]:     _ensure_netfailover_safe(self._network_config)
[   27.956994] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 96, in _ensure_netfailover_safe
[   27.958135] cloud-init[1535]:     mac_to_name = get_interfaces_by_mac()
[   27.958751] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 897, in get_interfaces_by_mac
[   27.959788] cloud-init[1535]:     return get_interfaces_by_mac_on_linux()
[   27.960406] cloud-init[1535]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 996, in get_interfaces_by_mac_on_linux
[   27.961503] cloud-init[1535]:     raise RuntimeError(msg)
[   27.962022] cloud-init[1535]: RuntimeError: duplicate mac found! both 'ens6' and 'ens5' have mac '00:13:97:6f:3d:9f'.
[   27.962936] cloud-init[1535]: ------------------------------------------------------------

We want those ens5 and ens6 Mellanox SR-IOV / virtual function interfaces to be ignored, as a custom script will configure bonding, what would be the best solution for this within the cloud-init framework?

The workaround today is to only attach those SR-IOV interfaces after the first boot, but this problem occurs if attached at first boot.

Also, I couldn't run collect-logs because I couldn't log in to the instance since the cloud-init process was stopped by this problem.

Thank you, Jeremy

TheRealFalcon commented 1 month ago

Create an instance connected with 2 or more Mellanox CX5 or CX6 SR-IOV virtual functions with the same MAC address. The driver is mlx5_core.

Can you provide more information on how to do this? Is there a way I can specify an SR-IOV bonded device at launch time? Is it inherent to a certain instance shape? If you use specific CLI launch args or options in the web interface, that would be helpful.

what would be the best solution for this within the cloud-init framework?

We currently workaround these types of devices in cloud-init on other platforms. We would need to adapt similar code for Oracle's platform.

I couldn't run collect-logs because I couldn't log in to the instance since the cloud-init process was stopped by this problem.

It'd be very helpful to get access to /var/log/cloud-init.log. Is the serial console an option?

jeremy-oracle commented 1 month ago

In terms of reproducing on your side, this is running on a PCA (Private Cloud Appliance) for development, so not yet available broadly. PCA is basically a mini OCI in a rack that customers can purchase to run on-premises with OCI compatible API.

So, looking at the serial console, we have this: (init-local)

[   45.708569] cloud-init[1544]: 2024-10-07 21:49:10,625 - util.py[WARNING]: Failed to parse IMDS network configuration!
[   45.711427] cloud-init[1544]: 2024-10-07 21:49:10,628 - util.py[WARNING]: failed stage init-local
[   45.712641] cloud-init[1544]: failed run of stage init-local
[   45.713359] cloud-init[1544]: ------------------------------------------------------------
[   45.714347] cloud-init[1544]: Traceback (most recent call last):
[   45.715086] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
[   45.716336] cloud-init[1544]:     ret = functor(name, args)
[   45.717017] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 442, in main_init
[   45.718226] cloud-init[1544]:     init.apply_network_config(bring_up=bring_up_interfaces)
[   45.719183] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 997, in apply_network_config
[   45.720464] cloud-init[1544]:     netcfg, src = self._find_networking_config()
[   45.721320] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 936, in _find_networking_config
[   45.722633] cloud-init[1544]:     if self.datasource and hasattr(self.datasource, "network_config"):
[   45.723712] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 273, in network_config
[   45.725106] cloud-init[1544]:     _ensure_netfailover_safe(self._network_config)
[   45.725982] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 96, in _ensure_netfailover_safe
[   45.727467] cloud-init[1544]:     mac_to_name = get_interfaces_by_mac()
[   45.728246] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 897, in get_interfaces_by_mac
[   45.729576] cloud-init[1544]:     return get_interfaces_by_mac_on_linux()
[   45.730364] cloud-init[1544]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 996, in get_interfaces_by_mac_on_linux
[   45.731812] cloud-init[1544]:     raise RuntimeError(msg)
[   45.732467] cloud-init[1544]: RuntimeError: duplicate mac found! both 'ens7' and 'ens8' have mac '00:13:97:87:a1:47'.
[   45.733673] cloud-init[1544]: ------------------------------------------------------------

and later this:

[   76.141036] cloud-init[2164]: Cloud-init v. 23.4-7.0.1.el9_4.3 running 'init' at Mon, 07 Oct 2024 21:49:41 +0000. Up 76.12 seconds.
[   76.155407] cloud-init[2164]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
[   76.156448] cloud-init[2164]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[   76.157451] cloud-init[2164]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
[   76.158443] cloud-init[2164]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[   76.159424] cloud-init[2164]: ci-info: |  ens3  | True |         192.168.0.3         | 255.255.255.0 | global | 00:13:97:0e:a9:93 |
[   76.160404] cloud-init[2164]: ci-info: |  ens3  | True | fe80::213:97ff:fe0e:a993/64 |       .       |  link  | 00:13:97:0e:a9:93 |
[   76.161389] cloud-init[2164]: ci-info: |  ens5  | True |              .              |       .       |   .    | 00:13:97:44:d5:fd |
[   76.162373] cloud-init[2164]: ci-info: |  ens6  | True |              .              |       .       |   .    | 00:13:97:44:d5:fd |
[   76.163374] cloud-init[2164]: ci-info: |  ens7  | True |              .              |       .       |   .    | 00:13:97:87:a1:47 |
[   76.164361] cloud-init[2164]: ci-info: |  ens8  | True |              .              |       .       |   .    | 00:13:97:87:a1:47 |
[   76.165343] cloud-init[2164]: ci-info: |   lo   | True |          127.0.0.1          |   255.0.0.0   |  host  |         .         |
[   76.166333] cloud-init[2164]: ci-info: |   lo   | True |           ::1/128           |       .       |  host  |         .         |
[   76.167313] cloud-init[2164]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[   76.168295] cloud-init[2164]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
[   76.169156] cloud-init[2164]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   76.170016] cloud-init[2164]: ci-info: | Route | Destination |   Gateway   |    Genmask    | Interface | Flags |
[   76.170871] cloud-init[2164]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   76.171723] cloud-init[2164]: ci-info: |   0   |   0.0.0.0   | 192.168.0.1 |    0.0.0.0    |    ens3   |   UG  |
[   76.172581] cloud-init[2164]: ci-info: |   1   | 192.168.0.0 |   0.0.0.0   | 255.255.255.0 |    ens3   |   U   |
[   76.173438] cloud-init[2164]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   76.174290] cloud-init[2164]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   76.175009] cloud-init[2164]: ci-info: +-------+-------------+---------+-----------+-------+
[   76.175743] cloud-init[2164]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   76.176467] cloud-init[2164]: ci-info: +-------+-------------+---------+-----------+-------+
[   76.177189] cloud-init[2164]: ci-info: |   1   |  fe80::/64  |    ::   |    ens3   |   U   |
[   76.177899] cloud-init[2164]: ci-info: |   3   |    local    |    ::   |    ens3   |   U   |
[   76.178618] cloud-init[2164]: ci-info: |   4   |  multicast  |    ::   |    ens3   |   U   |
[   76.179342] cloud-init[2164]: ci-info: |   5   |  multicast  |    ::   |    ens5   |   U   |
[   76.180060] cloud-init[2164]: ci-info: |   6   |  multicast  |    ::   |    ens6   |   U   |
[   76.180789] cloud-init[2164]: ci-info: |   7   |  multicast  |    ::   |    ens7   |   U   |
[   76.181506] cloud-init[2164]: ci-info: |   8   |  multicast  |    ::   |    ens8   |   U   |
[   76.182228] cloud-init[2164]: ci-info: +-------+-------------+---------+-----------+-------+
[   76.224201] cloud-init[2164]: 2024-10-07 21:49:41,141 - util.py[WARNING]: Failed to parse IMDS network configuration!
[   76.227326] cloud-init[2164]: 2024-10-07 21:49:41,144 - util.py[WARNING]: failed stage init
[   76.228539] cloud-init[2164]: failed run of stage init
[   76.229014] cloud-init[2164]: ------------------------------------------------------------
[   76.229726] cloud-init[2164]: Traceback (most recent call last):
[   76.230284] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
[   76.231208] cloud-init[2164]:     ret = functor(name, args)
[   76.231680] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 442, in main_init
[   76.232578] cloud-init[2164]:     init.apply_network_config(bring_up=bring_up_interfaces)
[   76.233276] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 997, in apply_network_config
[   76.234230] cloud-init[2164]:     netcfg, src = self._find_networking_config()
[   76.234843] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 936, in _find_networking_config
[   76.235810] cloud-init[2164]:     if self.datasource and hasattr(self.datasource, "network_config"):
[   76.236573] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 273, in network_config
[   76.237595] cloud-init[2164]:     _ensure_netfailover_safe(self._network_config)
[   76.238226] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceOracle.py", line 96, in _ensure_netfailover_safe
[   76.239319] cloud-init[2164]:     mac_to_name = get_interfaces_by_mac()
[   76.239873] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 897, in get_interfaces_by_mac
[   76.240866] cloud-init[2164]:     return get_interfaces_by_mac_on_linux()
[   76.241438] cloud-init[2164]:   File "/usr/lib/python3.9/site-packages/cloudinit/net/__init__.py", line 996, in get_interfaces_by_mac_on_linux
[   76.242507] cloud-init[2164]:     raise RuntimeError(msg)
[   76.242970] cloud-init[2164]: RuntimeError: duplicate mac found! both 'ens7' and 'ens8' have mac '00:13:97:87:a1:47'.
[   76.243868] cloud-init[2164]: ------------------------------------------------------------

Then I disconnect the SR-IOV interfaces, managed to reboot the instance properly, and ran sudo cloud-init collect-logs. See attached: issue-5794_collect-logs_cloud-init.tar.gz

Thank you :slightly_smiling_face:

jeremy-oracle commented 1 month ago

Also, it seems like adding mlx5_core to the tuple here stops the trace-back, but doesn't stop cloud-init to configure IPs on the base interfaces. In this case, those IPs should not be configured because they should be member of a bond interface. The bond carries the IP address, not its member interfaces. I am not sure there is such a mechanism in cloud-init to designate certain interfaces to be bonded with certain policy. As such I wrote a script to do this, but I want to avoid conflicts with the existing cloud init automation. :slightly_smiling_face:

jeremy-oracle commented 1 month ago

A patch like this might resolve this issue. I did a few reboot tests after doing sudo cloud-init clean --configs network --machine-id and it seemed to be working.

jeremy@jeremy-lx:~/dev/cloud-init$ git diff e10b09be321b81f82f1a2cb3b3724deedfefe9ff
diff --git a/cloudinit/net/__init__.py b/cloudinit/net/__init__.py
index 78b15a47b..dfd02f087 100644
--- a/cloudinit/net/__init__.py
+++ b/cloudinit/net/__init__.py
@@ -971,7 +971,7 @@ def get_interfaces_by_mac_on_linux() -> dict:
             # cloud-init happens to enumerate network interfaces before drivers
             # have fully initialized the leader/subordinate relationships for
             # those devices or switches.
-            if driver in ("fsl_enetc", "mscc_felix", "qmi_wwan"):
+            if driver in ("fsl_enetc", "mscc_felix", "qmi_wwan", "mlx5_core"):
                 LOG.debug(
                     "Ignoring duplicate macs from '%s' and '%s' due to "
                     "driver '%s'.",
diff --git a/tests/unittests/test_net.py b/tests/unittests/test_net.py
index 590061e03..9924a296e 100644
--- a/tests/unittests/test_net.py
+++ b/tests/unittests/test_net.py
@@ -5249,7 +5249,8 @@ class TestGetInterfacesByMac:
         assert expected == result

-@pytest.mark.parametrize("driver", ("mscc_felix", "fsl_enetc", "qmi_wwan"))
+@pytest.mark.parametrize("driver", ("mscc_felix", "fsl_enetc", "qmi_wwan",
+                                    "mlx5_core"))
 @mock.patch("cloudinit.net.get_sys_class_path")
 @mock.patch("cloudinit.util.system_info", return_value={"variant": "ubuntu"})
 class TestDuplicateMac:

I couldn't push my branch to origin, it seems like I am not allowed :slightly_smiling_face:

TheRealFalcon commented 1 month ago

Also, it seems like adding mlx5_core to the tuple here stops the trace-back, but doesn't stop cloud-init to configure IPs on the base interfaces.

Yes, your patch is essentially ignoring one of the duplicates but configuring the other, which is unideal as you mention.

We dealt with a similar issue on Azure where there was similar ignoring of 'mlx5_core', but it eventually evolved into this: https://github.com/canonical/cloud-init/pull/2153 . The solution doesn't work for you because it is on a different hypervisor, but I'd think the solution could look similar but using the driver name as surfaced in your cloud.

I couldn't push my branch to origin, it seems like I am not allowed

Correct. If you're looking to submit a PR, you need to fork the repo, push a branch to your remote, and then create a PR against the Canonical main branch.

jeremy-oracle commented 1 month ago

Thank you, I will have a look at #2153 .

Also, is this bug still incomplete? I still see the incomplete label. I couldn't find a way to remove it, as it is my understanding that it is now not missing any information :slightly_smiling_face:

TheRealFalcon commented 1 month ago

Sorry, removed the incomplete label.