netplan configuration not re-generated

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1847583

Launchpad details

affected_projects = []
assignee = None
assignee_name = None
date_closed = None
date_created = 2019-10-10T09:40:46.736880+00:00
date_fix_committed = None
date_fix_released = None
id = 1847583
importance = undecided
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1847583
milestone = None
owner = deshke
owner_name = Alexander Weber
private = False
status = triaged
submitter = deshke
submitter_name = Alexander Weber
tags = []
duplicates = []

Launchpad user Alexander Weber(deshke) wrote on 2019-10-10T09:40:46.736880+00:00

via https://bugs.launchpad.net/cloud-init/+bug/1846535/comments/40

steps to reproduce: (ec2/gcp/azure all have the same issue)

start a ubuntu 18.04 instance
update cloud-init to 19.2-36-g059d049c-0ubuntu2~18.04.1

cloud-init-output.log

Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 running 'init-local' at Thu, 10 Oct 2019 08:33:00 +0000. Up 11.53 seconds.
Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 running 'init' at Thu, 10 Oct 2019 08:33:03 +0000. Up 14.77 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |          Address           |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: |  ens5  | True |       172.31.18.122        | 255.255.240.0 | global | 02:a8:af:c9:34:90 |
ci-info: |  ens5  | True | fe80::a8:afff:fec9:3490/64 |       .       |  link  | 02:a8:af:c9:34:90 |
ci-info: |   lo   | True |         127.0.0.1          |   255.0.0.0   |  host  |         .         |
ci-info: |   lo   | True |          ::1/128           |       .       |  host  |         .         |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
ci-info: | Route | Destination |   Gateway   |     Genmask     | Interface | Flags |
ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 172.31.16.1 |     0.0.0.0     |    ens5   |   UG  |
ci-info: |   1   | 172.31.16.0 |   0.0.0.0   |  255.255.240.0  |    ens5   |   U   |
ci-info: |   2   | 172.31.16.1 |   0.0.0.0   | 255.255.255.255 |    ens5   |   UH  |
ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   1   |  fe80::/64  |    ::   |    ens5   |   U   |
ci-info: |   3   |    local    |    ::   |    ens5   |   U   |
ci-info: |   4   |   ff00::/8  |    ::   |    ens5   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
Generating public/private rsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_rsa_key.

root@ip-172-31-18-122:~# networkctl
IDX LINK             TYPE               OPERATIONAL SETUP
  1 lo               loopback           carrier     unmanaged
  2 ens5             ether              routable    configured

2 links listed.

root@ip-172-31-18-122:~# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        ens5:
            dhcp4: true
            dhcp6: true
            match:
                macaddress: 02:a8:af:c9:34:90
            set-name: ens5

update packages and install kernel 5.3.5
create a image of the now running instance
start a instance from the image
unable to connect
stop the instance
mount the disk to another running instance

mounting the disk to another instance, one can see that the netplan configuration was not updated

Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 running 'modules:config' at Thu, 10 Oct 2019 08:53:35 +0000. Up 24.40 seconds.
Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 running 'modules:final' at Thu, 10 Oct 2019 08:53:36 +0000. Up 25.32 seconds.
Cloud-init v. 19.2-36-g059d049c-0ubuntu1~18.04.1 finished at Thu, 10 Oct 2019 08:53:36 +0000. Datasource DataSourceEc2Local.  Up 25.43 seconds
Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 running 'init-local' at Thu, 10 Oct 2019 09:15:51 +0000. Up 16.36 seconds.
Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 running 'init' at Thu, 10 Oct 2019 09:15:52 +0000. Up 17.47 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: |  ens5  | False |     .     |     .     |   .   | 06:c1:88:5c:97:58 |
ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+

root@ip-172-31-40-235:/mnt# cat etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        ens5:
            dhcp4: true
            dhcp6: true
            match:
                macaddress: 02:a8:af:c9:34:90
            set-name: ens5

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2019-10-10T15:45:18.653019+00:00

Hi Alexander,

Thanks for the bug! Can you collect /var/log/cloud-init.log and /var/log/cloud-init-output.log from an affected instance, attach them to the bug and set it back to New, please?

Thanks!

Dan

ubuntu-server-builder commented 1 year ago

Launchpad user Richard Maynard(richard-maynard) wrote on 2019-10-10T19:16:58.817743+00:00

This is the same problem we're having that I referenced in a different bug.

We had the same problem in both GCP and AWS. One of my early attempts to work-around the issue was apt hold the cloud-init package to an older version. When that failed I thought our problem was not cloud-init related, but turns out that it doesn't matter if I hold back the version cloud init was still using the latest release when handling the networking.

I've attached our cloud-init.log.

We ended up implementing the work around from a different bug report, disabling cloud init's networking configuration and writing our own netplan configuration. This allowed us to continue with new image builds.

Our image build pipeline is driven by CI, and generally does this:

packer creates image based on a specific ubuntu-bionic image
packer uses a combination of exec + salt provisioners
packer copies the image to all regions and accounts where we deploy
terraform is run updating launch templates to use the new AMI + apply user data
other tooling cycles nodes through the autoscale group causing them to upgrade

I don't know if the fact that we use packer, build the image, then build again from that image is relevant, but it seems like it might be related to why the packer images using these AMI's with the default cloud-init work, but the images based off of these do not. Launchpad attachments: cloud-init.log

ubuntu-server-builder commented 1 year ago

Launchpad user Alexander Weber(deshke) wrote on 2019-10-11T13:52:22.548848+00:00

as wished one cloud-init-output.log Launchpad attachments: cloud-init-output.log

ubuntu-server-builder commented 1 year ago

Launchpad user Alexander Weber(deshke) wrote on 2019-10-11T13:52:50.553006+00:00

and one cloud-init.log Launchpad attachments: cloud-init.log

ubuntu-server-builder commented 1 year ago

Launchpad user Christian Rohmann(christian-rohmann) wrote on 2023-02-28T10:08:06.917673+00:00

Even though this issue is rather old and appears stalled, I refrained from raising another bug, but would love to revive the discussion ....

I just ran into exactly this issue of old netplan data ruining a simple reconfiguration when moving an OpenStack instance (VM) into another network (by attaching a different network ports with different MAC addresses).

When googling the error string of the failed MAC matching, others seem to still run into this as well: https://stackoverflow.com/questions/70671180/openstack-cloud-init-do-not-assign-proper-network-interface-to-instance

This certainly is in no way a weird cloning process of an existing instance, but changing or adding network interfaces is just part of the regular lifecycle in IaaS cloud environments.

I'd like to strongly argue that cloud-init should simply refresh / render the network config with the current metadata on every boot. This is the sole purpose of having dynamic metadata in cloud environments in the first place so that a reconfiguration will not require explicit config management on the instance.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2023-03-01T22:48:05.432841+00:00

Christian, is your issue related to the versions listed in this bug? On most clouds if you create a new instance from an AMI or move an instance, the new instance will come up fine. The cloud will issue a new instance ID causing cloud-init to fetch and apply network data.

It sounds like your Openstack installation isn't providing you a new instance id. If that is common or default Openstack functionality, I could see us moving Openstack to fetch network data every boot, but this adds some significant time to boot and not something we want across cloud-init.

I think that this is all unrelated to this bug though, so I think it would be better discussed on a separate bug (or IRC or discourse).

ubuntu-server-builder commented 1 year ago

Launchpad user Christian Rohmann(christian-rohmann) wrote on 2023-03-02T15:52:36.753281+00:00

Thanks James for the quick response.

OpenStack does not issue a new instance ID if you "just" change around some network interfaces of an instance. In OpenStack you can create ports independently from the instance and then attach those ports.

So the case I described is just about detaching a network port and then attaching another one (e.g. in another virtual network / subnet of an OpenStack cloud). Since the netplan config generated on the first run contains a match based on the MAC address and the new port does have a new one the network config will therefore be dysfunctional from then on.

And if the user already "deleted" the old port (e.g. because something like terraform was used), the old MAC address is also gone, so the instance ends up being unreachable and unrecoverable (if no root password was set to use the console or if the cloud does not provide "rescue" boots).

I know changing network ports it's not really really common, but is fetching the network config on each boot really that expensive / time-consuming?

If you want be me open another bug, I glady do so. I simply did not want to add to the long list of bugs for no reason.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2023-03-02T16:26:59.589707+00:00

Thanks for the context. We'll consider changing Openstack to fetch networking per boot if that's the default Openstack behavior. In the mean time, you can change it yourself by taking this example and putting it in your user data, /etc/cloud/cloud.cfg, or /etc/cloud/cloud.cfg.d/: https://cloudinit.readthedocs.io/en/latest/explanation/events.html#example

I know changing network ports it's not really really common, but is fetching the network config on each boot really that expensive / time-consuming?

To a degree, yes. Since it involves crawling multiple endpoints, it can take several seconds (I've seen up to 10) depending on the cloud. Some clouds care very much about boot time and don't want cloud-init spending that time when we don't need to.

If you want be me open another bug, I glady do so. I simply did not want to add to the long list of bugs for no reason.

If there's nothing else to discuss, then we can leave this as-is. I just didn't want to hijack this bug for a related but different issue.

ubuntu-server-builder commented 1 year ago

Launchpad user Christian Rohmann(christian-rohmann) wrote on 2023-03-09T17:25:07.692424+00:00

Thanks again for picking up on this is issue. If you in the end will not change the default to "always" update the network config (which I still believe does make sense), at least adding some words to the documentation on how to enable this "manually" would be great.

It's likely not only affecting OpenStack, but also other clouds allow for interfaces to be "moved" around (detached / re-attached) which will ultimately change MAC addresses or interface order. In the current state of things a reboot will not fix this, without the (thanks!) referenced config change.

frittentheke commented 5 months ago

Launchpad user James Falcon(falcojr) wrote on 2023-03-02T16:26:59.589707+00:00_

Thanks for the context. We'll consider changing Openstack to fetch networking per boot if that's the default Openstack behavior.

@TheRealFalcon is this issue still being tracked and are you still planning on changing the default at some point? While manually forcing the network to be updated on every boot (https://cloudinit.readthedocs.io/en/latest/explanation/events.html#example) works, OpenStack Nova does not (yet) allow for user_data to be mutated, see https://review.opendev.org/q/topic:%22bp/update-userdata%22 about the efforts to implement this. And since the network configuration is influenced by adding or removing network ports and the only way out currently is to use the console to manually log in and adjust the netplan config or to rebuild the VM.

holmanb commented 3 months ago

I wonder if an event-driven (socket-activated service) doorbell telling cloud-init to re-crawl network configuration would be a better bet in the long term. This could be generic for use across multiple clouds, and with something like a unix socket to activate on, various container and virtual machine technologies would be able to support this.

This is really just a slightly different use case of the current hotplug code.

TheRealFalcon commented 3 months ago

This is really just a slightly different use case of the current hotplug code.

How do you see it being at all different? Hotplug still doesn't handle when the machine is powered off, so I think per-boot is still the way forward. Yes, we should change the default.

holmanb commented 3 months ago

This is really just a slightly different use case of the current hotplug code.

How do you see it being at all different? Hotplug still doesn't handle when the machine is powered off, so I think per-boot is still the way forward. Yes, we should change the default.

I imagine a hypervisor-triggered event for anytime the network configuration changes.

The upsides of this design over per-boot are:

performance: would not unnecessarily slow boot time
flexibility: run anytime, not just during startup
flexibility: not dependent on device hotplug
feedback loop: cloud-init could "close the loop" to tell the cloud that it has configured network
elegance: per-boot is clunky, wouldn't require cloud-specific code in cloud-init, it could "just work" the same way for every cloud that chooses to support it (with a dedicated unix socket path)

The big and obvious downsides are:

requires clouds to opt-in
might not work on every hypervisor platform (does vmware support host-guest unix sockets?)
unclear how event-driven services might work on non-systemd (inetd perhaps?)

TheRealFalcon commented 3 months ago

The upsides of this design over per-boot

We still need both though. When is NIC is added while the machine is powered off, we need to know of its existence when we power it back on.

I imagine a hypervisor-triggered event for anytime the network configuration changes.

We can always ask, but I'd be shocked if we were to get anything like this. I realize that udev mechanics are a little clunkier, but what advantage to the end user do you see to this over using udev?

frittentheke commented 3 months ago

@TheRealFalcon @holmanb whatever you end up doing, I am really grateful you are discussing this matter of dynamic interface / network configuration!

holmanb commented 3 months ago

The upsides of this design over per-boot

We still need both though. When is NIC is added while the machine is powered off, we need to know of its existence when we power it back on.

Sure, we support a huge number of clouds, of course we wouldn't convert overnight.

I imagine a hypervisor-triggered event for anytime the network configuration changes.

We can always ask, but I'd be shocked if we were to get anything like this.

That's the beauty of an event driven cloud-agnostic design. It doesn't even depend on them building anything. We can build it, document a specification which is "the way to do this with cloud-init". Cloud-init won't waste resources unless it is implemented by the cloud. Then we could implement it in (or advertise it to the devs of) the open source clouds: (LXD / MAAS / OpenStack / CloudStack / OpenNebula / etc), and then allow the clouds to make use of it if they see the value.

My understanding is that new custom cloud infrastructure features are very expensive because of the pressure that clouds feel to maintain things forever once they support it. I think that this is different because they would have to do is wire up an event based on image type, which I would expect to be a lot less expensive than a service.

I realize that udev mechanics are a little clunkier, but what advantage to the end user do you see to this over using udev?

Udev is only relevant to interface changes currently. There are plenty of times that a user or cloud might want to modify the network configuration in a way that doesn't involve interface changes. One might want to add a public IP, update routes, etc, etc. These changes are often cloud-driven through a higher level abstraction that could trickle down to running instances much more cleanly if something like this were possible.

Imagine a point-and-click user goes to their dashboard, adds a public IP that they own to a specific instance, they see a little spinning circle for 5 seconds while the backend pokes cloud-init and waits for a response, and then they see a completion event signaling that their new IP has been configured on the instance.

Cloud-init's current story for picking up new network configurations is currently what, reboot your instance?

sdwru commented 2 weeks ago

Not sure if this is related, but when I had this problem I found that it was because my default cloud-init network renderer was eni instead of netplan. The reason for that was because I had ifupdown installed.

apt purge ifupdown solved it for me.

canonical / cloud-init

netplan configuration not re-generated #3463