canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.91k stars 866 forks source link

Networking failures after NIC reordering #3939

Open ubuntu-server-builder opened 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1958280

Launchpad details
affected_projects = ['netplan']
assignee = None
assignee_name = None
date_closed = None
date_created = 2022-01-18T17:19:31.963574+00:00
date_fix_committed = None
date_fix_released = None
id = 1958280
importance = high
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1958280
milestone = None
owner = cjp256
owner_name = Chris Patterson
private = False
status = triaged
submitter = cjp256
submitter_name = Chris Patterson
tags = []
duplicates = []

Launchpad user Chris Patterson(cjp256) wrote on 2022-01-18T17:19:31.963574+00:00

We can reliably reproduce a case where network configuration changes for an Ubuntu 20.04 VM results in a networkd hanging on "pending" interfaces. The interfaces are pending because of conflicts in naming from the current boot and that found in /etc/netplan/50-cloud-init.yaml from previous boot

Specifically, the netplan generator applies the previous configuration's names prior to running cloud-init local. We'll see something like systemd-udevd[228]: eth0: Failed to process device, ignoring: File exists.

In one scenario, the data source is able to fetch updated network configuration, and cloud-init updates the config & udev rules just fine. However, networking stays offline ("pending") indefinitely. It can be forced to resolve by executing sudo udevadm trigger --attr-match=subsystem=net.

Example: Create a VM on Azure with two NICs, re-order them, then restart.

az vm create --name test-x1 --image Canonical:0001-com-ubuntu-server-focal:20_04-lts:latest --nics test-nic-01 test-nic-02 az vm deallocate --name test-x1 az vm nics set --vm-name test-x1 --nics test-nic-02 test-nic-01 az vm start --name test-x1

Upon doing that I am unable to login via serial console for 20 minutes until cloud init times out. In this case, Azure is trying to report ready but cannot because system networking never came up. We can remove /lib/systemd/system/cloud-init-local.service.d/50-azure-clear-persistent-obj-pkl.conf, cloud-init doesn't hang the boot, but networking still fails to initialize for the guest.

The behavior for 18.04 is a bit different. On 18.04, the renaming of the interfaces succeeds at early boot, which instead results in the Azure data source failing the local phase because the fallback_interface is no longer the primary NIC (eth1 secondary was renamed to eth0 to match previous boot's config).

ubuntu-server-builder commented 1 year ago

Launchpad user Chris Patterson(cjp256) wrote on 2022-01-18T17:19:31.963574+00:00

Launchpad attachments: Ubuntu 20.04 nic swap logs

ubuntu-server-builder commented 1 year ago

Launchpad user Chris Patterson(cjp256) wrote on 2022-01-18T17:19:53.411383+00:00

Launchpad attachments: Ubuntu 18.04 nic swap logs

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2022-01-19T22:52:55.104750+00:00

Thanks for the thorough bug report. I have confirmed the 20.04 behavior and the root cause.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2022-01-21T17:22:08.679657+00:00

Adding netplan here as cloud-init is generating the netplan config correctly before network comes up.

cjp256 commented 1 year ago

@TheRealFalcon feel free to assign this to me

holmanb commented 4 months ago

@cjp256 any ideas on how to resolve this one? At first blush, this looks like a netplan issue. I'm not sure how we would solve this in cloud-init that wouldn't just be a workaround.

cjp256 commented 4 months ago

@holmanb The only option I think will work is dropping set-name usage for Azure datasource in the netplan config. They should be ordered fine during system enumeration (i.e. eth0 is primary until we force it to swap due to config).

My concern is potential for side effects where set-name may be important. I don't know of a situation, but hard to say there isn't one. Maybe we can add an option to the Azure datasource to toggle the behavior and change the default only for new distro versions to minimize risk?