canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
3k stars 885 forks source link

cloud-init overriding set-name in netplan file #4074

Open ubuntu-server-builder opened 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #2006106

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = None
date_created = 2023-02-06T08:32:36.504329+00:00
date_fix_committed = None
date_fix_released = None
id = 2006106
importance = medium
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/2006106
milestone = None
owner = eest
owner_name = Patrik Lundin
private = False
status = triaged
submitter = eest
submitter_name = Patrik Lundin
tags = []
duplicates = []

Launchpad user Patrik Lundin(eest) wrote on 2023-02-06T08:32:36.504329+00:00

After creating an Ubuntu 22.04 instance in OpenStack the following netplan file is generated:

# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        ens3:
            accept-ra: true
            dhcp4: true
            dhcp6: true
            match:
                macaddress: fa:16:3e:c7:f9:7e
            mtu: 1500
            set-name: ens3

With the matching links:

# ip -br l
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
ens3             UP             fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST,UP,LOWER_UP>

I was then trying to rename the interface from "ens3" to "eth0", updating the file like so:

# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        eth0:
            accept-ra: true
            dhcp4: true
            dhcp6: true
            match:
                macaddress: fa:16:3e:c7:f9:7e
            mtu: 1500
            set-name: eth0

Applying the config works, the interface is renamed without dropping my SSH connection:

# netplan apply

# ip -br l
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0             UP             fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST,UP,LOWER_UP>

So far so good, but now I reboot the machine, and it will not come back online:

# reboot
Connection to XXX.XXX.XXX.XXX closed by remote host.
Connection to XXX.XXX.XXX.XXX closed.

Logging in via a locally connected console I can see the following:

# ip -br l
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
ens3             DOWN           fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST>

So for some reason the interface comes up as "ens3" again, also it has no address configuration assigned which is the reason I can not reach it. If I then run a manual "netplan apply" I can get it online again:

# netplan apply
# ip -br l
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0             UP             fa:16:3e:c7:f9:7e <BROADCAST,MULTICAST,UP,LOWER_UP>

Now logged in over SSH again checking the dmesg log for renames the following can be seen:

# dmesg | grep rename
[    2.142770] virtio_net virtio0 ens3: renamed from eth0
[    6.089816] virtio_net virtio0 eth0: renamed from ens3
[    7.253661] virtio_net virtio0 ens3: renamed from eth0
[  278.607558] virtio_net virtio0 eth0: renamed from ens3

So the network name has been flapping back and forth between "ens3" and "eth0".

After digging around I think this is what happens:

[    2.142770] virtio_net virtio0 ens3: renamed from eth0 <- systemd-networkd, as part of initramfs
[    6.089816] virtio_net virtio0 eth0: renamed from ens3 <- systemd-networkd, as part of booted OS, using the files generated by my initial "netplan apply".
[    7.253661] virtio_net virtio0 ens3: renamed from eth0 <- cloud-init, for some reason
[  278.607558] virtio_net virtio0 eth0: renamed from ens3 <- my manual "netplan apply" after logging in to the console 

Looking at /var/log/cloud-init.log the following message is seen:

2023-02-06 07:57:27,270 - __init__.py[DEBUG]: Detected interfaces {'eth0': {'downable': True, 'device_id': '0x0001', 'driver': 'virtio_net', 'mac': 'fa:16:3e:c7:f9:7e', 'name': 'eth0', 'up': False}, 'lo': {'downable': False, 'device_id': None, 'driver': None, 'mac': '00:00:00:00:00:00', 'name': 'lo', 'up': True}}
2023-02-06 07:57:27,270 - __init__.py[DEBUG]: achieving renaming of [['fa:16:3e:c7:f9:7e', 'ens3', None, None]] with ops [('rename', 'fa:16:3e:c7:f9:7e', 'ens3', ('eth0', 'ens3'))]
2023-02-06 07:57:27,270 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'eth0', 'name', 'ens3'] with allowed return codes [0] (shell=False, capture=True)

I had a hard time understanding how cloud-init knew about the previous "ens3" name initially, but now I think this has been persisted in the obj.pkl at initial install time boot and is now picked up on subsequent boots, from that same log:

2023-02-06 07:57:27,211 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)

Taking a look in the file:

# cat p.py
#!/usr/bin/env python3

import pickle

# open a file, where you stored the pickled data
with open('/var/lib/cloud/instance/obj.pkl', 'rb') as file:
    data = pickle.load(file)

print(data.network_config)
# ./p.py
{'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:c7:f9:7e', 'name': 'ens3'}, {'type': 'nameserver', 'address': 'XXX.XXX.XXX.XXX'}, {'type': 'nameserver', 'address': 'YYYY:YYYY:YYYY::YYYY:YYYY:YYYY'}]}

From what I can tell this "name" is picked up in the openstack helper at https://github.com/canonical/cloud-init/blob/483f79cb3b94c8c7d176e748892a040c71132cb3/cloudinit/sources/helpers/openstack.py#L715

So... the question then is, how should this work? Right now it seems cloud-init is helping me with a rename even if I have asked the netplan file to set another name than the machine had at initial install.

One thing that occured to me is that maybe I am expected to feed cloud-init user-data so it can know initially that I want the interface called "eth0", but reading https://cloudinit.readthedocs.io/en/22.4.2/topics/network-config.html it states "User-data cannot change an instance’s network configuration." so it seems this is not expected behaviour.

For now I guess the simplest workaround is to just disable the network management parts as mentioned in the generated netplan file, this works:

# echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
# reboot

Now the machine comes up by itself, and there are less renames happening:

# dmesg | grep rename
[    2.165152] virtio_net virtio0 ens3: renamed from eth0
[    6.108291] virtio_net virtio0 eth0: renamed from ens3

It feels strange to have to disable the network management parts... What would be the correct way to deal with this situation?

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2023-02-10T05:40:34.121432+00:00

Thanks for filing this bug and helping make cloud-init better. Let's see if we can get to the root of the problem.

This may involve us requesting your attached logs from running cloud-init collect-logs and attaching the corresponding tar file.

Please check that tarfile instance-data-sensitive.json before attaching because it could contain sensitive information if you provided passwords or user-credentials in user-data on the affected VM.

Minimally I think we need to see the output of journalctl -b 0 -o short-precise and the full cloud-init.log. (which are both grabbed by cloud-init collect-logs anyway).

Generally, I don't think the OpenStack datasource default behavior should be for cloud-init to be actively rewriting or re-applying network config across reboot. It generally should be inert unless the datasource IMDS (instance metadata) either changes the instance-id in meta-data to a new UUID (telling cloud-init it needs to reconfigure the world) or if OpenStack was configured to re-render network per-boot.

So, we might have a bug that LinuxNetworking.apply_network_config_names is running more often than it should across normal system reboots even when the DataSourceOpenStack hasn't told cloud-init to re-render and re-apply new networking config due to BOOT_NEW_INSTANCE event.

I would have expected cloud-init to exit and do nothing with network renames across normal reboots due to these checks https://github.com/canonical/cloud-init/blob/main/cloudinit/stages.py#L905-L916

I think it will help to see full cloud-init.log here to surmise what really has happened with all the PER_BOOT, PER_INSTANCE_REBOOT, datasource cache validation, instance-id and event checks. So we can better determine why cloud-init thinks it should be touching anything w/ network renames across subsequent boots.

I'll set this to 'incomplete' status above, but please set it back to 'new' status when you get a chance to attach logs.

ubuntu-server-builder commented 1 year ago

Launchpad user Patrik Lundin(eest) wrote on 2023-02-20T16:56:35.981152+00:00

Attached is the result of running "cloud-init collect-logs" at the point where the netplan file has been modified to state "eth0", "netplan apply" has been run (replacing "ens3" with "eth0" at runtime), the machine has been rebooted and then ends up with an unconfigured "ens3" interface instead of the expected "eth0". Launchpad attachments: cloud-init.tar.gz

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2023-03-01T19:47:00.311114+00:00

Thank you much for the logs Patrik.

I can see logs indicating what you suggested, renames applying every reboot regardless of whether the datasource and network has been actively detected and applied: We shouldn't see the "applying net names" logs when "No network config applied. Neither a new instance nor datasource network update allowed". I agree that this bug is undesireable behavior, cloud-init should remain inert on renames because sysadmins could have gone in and changed the static /etc/netplan/*yaml to represent something other than cloud-init's original config.

That said, editing /etc/netplan/50-cloud-init.yaml is also a recipe for problems in the future if the instance-id presented by OpenStack to this node changes the product_uuid for this vm via /sys/class/dmi/id/product_uuid. When that happens, cloud-init will recrawl the OpenStack IMDS endpoints @ 169.254.169.254 and rewrite all network and system configuration, blowing away changes to the 50-cloud-init.yaml file.

Confirmation of logs applying net names on 2nd reboot in 'init' and 'init-local'stage leading to name thrashing.

$ egrep -i 'applying net|Cloud-init v.|netplan' YOUR_LOGS 2023-02-20 16:19:42,270 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Mon, 20 Feb 2023 16:19:42 +0000. Up 6.51 seconds. 2023-02-20 16:19:43,780 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]} 2023-02-20 16:19:43,786 - stages.py[INFO]: Applying network configuration from ds bringup=False: {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]} 2023-02-20 16:19:43,788 - init.py[DEBUG]: Selected renderer 'netplan' from priority list: ['netplan', 'eni', 'sysconfig'] 2023-02-20 16:19:43,791 - subp.py[DEBUG]: Running command ['netplan', 'info'] with allowed return codes [0] (shell=False, capture=True) 2023-02-20 16:19:43,974 - util.py[DEBUG]: Writing to /etc/netplan/50-cloud-init.yaml - wb: [644] 555 bytes 2023-02-20 16:19:43,975 - subp.py[DEBUG]: Running command ['netplan', 'generate'] with allowed return codes [0] (shell=False, capture=True) 2023-02-20 16:19:46,359 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Mon, 20 Feb 2023 16:19:46 +0000. Up 10.60 seconds. 2023-02-20 16:19:46,540 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]} 2023-02-20 16:19:51,905 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Mon, 20 Feb 2023 16:19:51 +0000. Up 16.09 seconds. 2023-02-20 16:19:53,116 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Mon, 20 Feb 2023 16:19:53 +0000. Up 17.31 seconds. 2023-02-20 16:19:53,331 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Mon, 20 Feb 2023 16:19:53 +0000. Datasource DataSourceOpenStackLocal [net,ver=2]. Up 17.58 seconds 2023-02-20 16:23:51,279 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Mon, 20 Feb 2023 16:23:51 +0000. Up 6.86 seconds. 2023-02-20 16:23:51,357 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]} 2023-02-20 16:23:51,937 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Mon, 20 Feb 2023 16:23:51 +0000. Up 7.52 seconds. 2023-02-20 16:23:51,993 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'mtu': 1500, 'type': 'physical', 'accept-ra': True, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}], 'mac_address': 'fa:16:3e:19:ed:75', 'name': 'ens3'}, {'type': 'nameserver', 'address': '89.32.32.32'}, {'type': 'nameserver', 'address': '2001:6b0:89::32:32:32'}]} 2023-02-20 16:23:53,393 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Mon, 20 Feb 2023 16:23:53 +0000. Up 8.95 seconds. 2023-02-20 16:23:53,814 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Mon, 20 Feb 2023 16:23:53 +0000. Up 9.38 seconds. 2023-02-20 16:23:53,877 - util.py[DEBUG]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Mon, 20 Feb 2023 16:23:53 +0000. Datasource DataSourceOpenStackLocal [net,ver=2]. Up 9.47 seconds

ubuntu-server-builder commented 1 year ago

Launchpad user Patrik Lundin(eest) wrote on 2023-03-02T09:41:10.931703+00:00

Hello,

Thanks for the follow-up. If editing "50-cloud-init.yaml" is not a good idea, what is the appropriate way to deal with a rename then? Even if we remove "50-cloud-init.yaml" and create another file, if the "instance-id" changes (is this likely?) will this not just result in "50-cloud-init.yaml" being recreated, now leading to having two conflicting netplan files instead?

If the correct thing is "feed cloud-init this information from the start" then I am not sure how to properly do that: as I stated initially it is documented that you are not allowed to configure network stuff via user-data (https://cloudinit.readthedocs.io/en/22.4.2/topics/network-config.html)