canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.92k stars 871 forks source link

Don't Break On Duplicate Mac Addresses #4043

Open ubuntu-server-builder opened 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1996789

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = None
date_created = 2022-11-16T17:15:59.577005+00:00
date_fix_committed = None
date_fix_released = None
id = 1996789
importance = medium
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1996789
milestone = None
owner = holmanb
owner_name = Brett Holman
private = False
status = triaged
submitter = holmanb
submitter_name = Brett Holman
tags = ['sts']
duplicates = []

Launchpad user Brett Holman(holmanb) wrote on 2022-11-16T17:15:59.577005+00:00

Currently when duplicate mac addresses are detected, cloud-init dies.

While duplicate macs are typically corner cases, there are cases when they can be valid[1].

Consider this example[2]. After bonding two interfaces, the interfaces were left with duplicate mac addresses. Using cloud-init on this system fails at the time that these devices are detected.

If no network config is given, or if a config is given configuring a single address, we have the opportunity to do something intelligent to allow cloud-init to boot by using the "fallback interface" (in cloud-init this is the first interface), rather than throwing an exception and dying.

Netplan's mac matching assumes 1:1 mapping between mac addresses and interfaces, so in the case of multiple interfaces configured with matches, we still can't do anything intelligent.

[1] Until these have unique addresses, these interfaces will not be usable on the same broadcast domain, but they should still be able to work individually on different networks. [2] https://stackoverflow.com/questions/74459180/deleted-bond-interface-left-me-with-duplicate-mac-on-two-interfaces

ubuntu-server-builder commented 1 year ago

Launchpad user Brett Holman(holmanb) wrote on 2022-11-16T17:15:59.577005+00:00

Launchpad attachments: failure on detection

ubuntu-server-builder commented 1 year ago

Launchpad user Trent Lloyd(lathiat) wrote on 2023-03-22T03:54:10.747950+00:00

I ran into this issue when doing SR-IOV Bonding on OpenStack. We can assign two VFs with the same MAC. An example of doing that is here: https://www.redpill-linpro.com/techblog/2021/01/30/bonding-sriov-nics-with-openstack.html

While you can use unique MACs and use fail-over-mac-policy=active - then your metadata+DHCP breaks when using the slave interface. So it's ideal to have a duplicate as an option.

We keep running into this in various scenarios and already have multiple workarounds: OVS bridge duplicates: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1912844 Azure advanced networking: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1844191 Oracle net_failover: https://github.com/canonical/cloud-init/commit/fa47d527a03a00319936323f0a857fbecafceaf7

In most cases the real use case for this is some kind of VF-to-virtio failover for live migration or bonding (such is the case for both oracle net_failover and azure). Sometimes it's because a bridge, bond or OVS duplicates/steals a MAC - we also have special case code for handling that.

Currently when you hit this, cloud-init errors out and attempts no network configuration.

It would be ideal for cloud-init to make an attempt to configure the network with one of the interfaces - perhaps the one that already has the correct name or with some kind of priority that may have specifics for each driver type we already have exceptions for (ignore ovs/bridge/bond, prioritise the correct net_failover device, etc).

VertigoOne1 commented 1 year ago

Ran into this just now after deploying kubernetes, using calico CNI, duplicate MAC's on calico interfaces after deployment of the helm chart and a reboot

2023-08-16 07:13:34,446 - util.py[WARNING]: failed stage init failed run of stage init

Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 761, in status_wrapper ret = functor(name, args) File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 433, in main_init init.apply_network_config(bring_up=bring_up_interfaces) File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 909, in apply_network_config self.distro.networking.wait_for_physdevs(netcfg) File "/usr/lib/python3.9/site-packages/cloudinit/distros/networking.py", line 147, in wait_for_physdevs present_macs = self.get_interfaces_by_mac().keys() File "/usr/lib/python3.9/site-packages/cloudinit/distros/networking.py", line 74, in get_interfaces_by_mac return net.get_interfaces_by_mac( File "/usr/lib/python3.9/site-packages/cloudinit/net/init.py", line 870, in get_interfaces_by_mac return get_interfaces_by_mac_on_linux( File "/usr/lib/python3.9/site-packages/cloudinit/net/init.py", line 944, in get_interfaces_by_mac_on_linux raise RuntimeError( RuntimeError: duplicate mac found! both 'cali9a68072be50' and 'calib855784d906' have mac 'ee:ee:ee:ee:ee:ee'

version : /bin/cloud-init 22.1-10.el9_2.alma

this is the latest possible alma cloud image.

My thoughts would be to allow some kind of configuration property that allows specifying regex for whitelist or blacklist of network address scope of operation. We often use cali* to control certain behaviours for example with iptables. This may materialise in a few situations with containerised workloads.