Open ubuntu-server-builder opened 1 year ago
Launchpad user Dan Watkins(oddbloke) wrote on 2019-10-10T15:45:18.653019+00:00
Hi Alexander,
Thanks for the bug! Can you collect /var/log/cloud-init.log and /var/log/cloud-init-output.log from an affected instance, attach them to the bug and set it back to New, please?
Thanks!
Dan
Launchpad user Richard Maynard(richard-maynard) wrote on 2019-10-10T19:16:58.817743+00:00
This is the same problem we're having that I referenced in a different bug.
We had the same problem in both GCP and AWS. One of my early attempts to work-around the issue was apt hold the cloud-init package to an older version. When that failed I thought our problem was not cloud-init related, but turns out that it doesn't matter if I hold back the version cloud init was still using the latest release when handling the networking.
I've attached our cloud-init.log.
We ended up implementing the work around from a different bug report, disabling cloud init's networking configuration and writing our own netplan configuration. This allowed us to continue with new image builds.
Our image build pipeline is driven by CI, and generally does this:
I don't know if the fact that we use packer, build the image, then build again from that image is relevant, but it seems like it might be related to why the packer images using these AMI's with the default cloud-init work, but the images based off of these do not. Launchpad attachments: cloud-init.log
Launchpad user Alexander Weber(deshke) wrote on 2019-10-11T13:52:22.548848+00:00
as wished one cloud-init-output.log Launchpad attachments: cloud-init-output.log
Launchpad user Alexander Weber(deshke) wrote on 2019-10-11T13:52:50.553006+00:00
and one cloud-init.log Launchpad attachments: cloud-init.log
Launchpad user Christian Rohmann(christian-rohmann) wrote on 2023-02-28T10:08:06.917673+00:00
Even though this issue is rather old and appears stalled, I refrained from raising another bug, but would love to revive the discussion ....
I just ran into exactly this issue of old netplan data ruining a simple reconfiguration when moving an OpenStack instance (VM) into another network (by attaching a different network ports with different MAC addresses).
When googling the error string of the failed MAC matching, others seem to still run into this as well: https://stackoverflow.com/questions/70671180/openstack-cloud-init-do-not-assign-proper-network-interface-to-instance
This certainly is in no way a weird cloning process of an existing instance, but changing or adding network interfaces is just part of the regular lifecycle in IaaS cloud environments.
I'd like to strongly argue that cloud-init should simply refresh / render the network config with the current metadata on every boot. This is the sole purpose of having dynamic metadata in cloud environments in the first place so that a reconfiguration will not require explicit config management on the instance.
Launchpad user James Falcon(falcojr) wrote on 2023-03-01T22:48:05.432841+00:00
Christian, is your issue related to the versions listed in this bug? On most clouds if you create a new instance from an AMI or move an instance, the new instance will come up fine. The cloud will issue a new instance ID causing cloud-init to fetch and apply network data.
It sounds like your Openstack installation isn't providing you a new instance id. If that is common or default Openstack functionality, I could see us moving Openstack to fetch network data every boot, but this adds some significant time to boot and not something we want across cloud-init.
I think that this is all unrelated to this bug though, so I think it would be better discussed on a separate bug (or IRC or discourse).
Launchpad user Christian Rohmann(christian-rohmann) wrote on 2023-03-02T15:52:36.753281+00:00
Thanks James for the quick response.
OpenStack does not issue a new instance ID if you "just" change around some network interfaces of an instance. In OpenStack you can create ports independently from the instance and then attach those ports.
So the case I described is just about detaching a network port and then attaching another one (e.g. in another virtual network / subnet of an OpenStack cloud). Since the netplan config generated on the first run contains a match based on the MAC address and the new port does have a new one the network config will therefore be dysfunctional from then on.
And if the user already "deleted" the old port (e.g. because something like terraform was used), the old MAC address is also gone, so the instance ends up being unreachable and unrecoverable (if no root password was set to use the console or if the cloud does not provide "rescue" boots).
I know changing network ports it's not really really common, but is fetching the network config on each boot really that expensive / time-consuming?
If you want be me open another bug, I glady do so. I simply did not want to add to the long list of bugs for no reason.
Launchpad user James Falcon(falcojr) wrote on 2023-03-02T16:26:59.589707+00:00
Thanks for the context. We'll consider changing Openstack to fetch networking per boot if that's the default Openstack behavior. In the mean time, you can change it yourself by taking this example and putting it in your user data, /etc/cloud/cloud.cfg, or /etc/cloud/cloud.cfg.d/: https://cloudinit.readthedocs.io/en/latest/explanation/events.html#example
I know changing network ports it's not really really common, but is fetching the network config on each boot really that expensive / time-consuming?
To a degree, yes. Since it involves crawling multiple endpoints, it can take several seconds (I've seen up to 10) depending on the cloud. Some clouds care very much about boot time and don't want cloud-init spending that time when we don't need to.
If you want be me open another bug, I glady do so. I simply did not want to add to the long list of bugs for no reason.
If there's nothing else to discuss, then we can leave this as-is. I just didn't want to hijack this bug for a related but different issue.
Launchpad user Christian Rohmann(christian-rohmann) wrote on 2023-03-09T17:25:07.692424+00:00
Thanks again for picking up on this is issue. If you in the end will not change the default to "always" update the network config (which I still believe does make sense), at least adding some words to the documentation on how to enable this "manually" would be great.
It's likely not only affecting OpenStack, but also other clouds allow for interfaces to be "moved" around (detached / re-attached) which will ultimately change MAC addresses or interface order. In the current state of things a reboot will not fix this, without the (thanks!) referenced config change.
Launchpad user James Falcon(falcojr) wrote on 2023-03-02T16:26:59.589707+00:00_
Thanks for the context. We'll consider changing Openstack to fetch networking per boot if that's the default Openstack behavior.
@TheRealFalcon is this issue still being tracked and are you still planning on changing the default at some point? While manually forcing the network to be updated on every boot (https://cloudinit.readthedocs.io/en/latest/explanation/events.html#example) works, OpenStack Nova does not (yet) allow for user_data to be mutated, see https://review.opendev.org/q/topic:%22bp/update-userdata%22 about the efforts to implement this. And since the network configuration is influenced by adding or removing network ports and the only way out currently is to use the console to manually log in and adjust the netplan config or to rebuild the VM.
I wonder if an event-driven (socket-activated service) doorbell telling cloud-init to re-crawl network configuration would be a better bet in the long term. This could be generic for use across multiple clouds, and with something like a unix socket to activate on, various container and virtual machine technologies would be able to support this.
This is really just a slightly different use case of the current hotplug code.
This is really just a slightly different use case of the current hotplug code.
How do you see it being at all different? Hotplug still doesn't handle when the machine is powered off, so I think per-boot is still the way forward. Yes, we should change the default.
This is really just a slightly different use case of the current hotplug code.
How do you see it being at all different? Hotplug still doesn't handle when the machine is powered off, so I think per-boot is still the way forward. Yes, we should change the default.
I imagine a hypervisor-triggered event for anytime the network configuration changes.
The upsides of this design over per-boot are:
The big and obvious downsides are:
The upsides of this design over per-boot
We still need both though. When is NIC is added while the machine is powered off, we need to know of its existence when we power it back on.
I imagine a hypervisor-triggered event for anytime the network configuration changes.
We can always ask, but I'd be shocked if we were to get anything like this. I realize that udev mechanics are a little clunkier, but what advantage to the end user do you see to this over using udev?
@TheRealFalcon @holmanb whatever you end up doing, I am really grateful you are discussing this matter of dynamic interface / network configuration!
The upsides of this design over per-boot
We still need both though. When is NIC is added while the machine is powered off, we need to know of its existence when we power it back on.
Sure, we support a huge number of clouds, of course we wouldn't convert overnight.
I imagine a hypervisor-triggered event for anytime the network configuration changes.
We can always ask, but I'd be shocked if we were to get anything like this.
That's the beauty of an event driven cloud-agnostic design. It doesn't even depend on them building anything. We can build it, document a specification which is "the way to do this with cloud-init". Cloud-init won't waste resources unless it is implemented by the cloud. Then we could implement it in (or advertise it to the devs of) the open source clouds: (LXD / MAAS / OpenStack / CloudStack / OpenNebula / etc), and then allow the clouds to make use of it if they see the value.
My understanding is that new custom cloud infrastructure features are very expensive because of the pressure that clouds feel to maintain things forever once they support it. I think that this is different because they would have to do is wire up an event based on image type, which I would expect to be a lot less expensive than a service.
I realize that udev mechanics are a little clunkier, but what advantage to the end user do you see to this over using udev?
Udev is only relevant to interface changes currently. There are plenty of times that a user or cloud might want to modify the network configuration in a way that doesn't involve interface changes. One might want to add a public IP, update routes, etc, etc. These changes are often cloud-driven through a higher level abstraction that could trickle down to running instances much more cleanly if something like this were possible.
Imagine a point-and-click user goes to their dashboard, adds a public IP that they own to a specific instance, they see a little spinning circle for 5 seconds while the backend pokes cloud-init and waits for a response, and then they see a completion event signaling that their new IP has been configured on the instance.
Cloud-init's current story for picking up new network configurations is currently what, reboot your instance?
Not sure if this is related, but when I had this problem I found that it was because my default cloud-init network renderer was eni instead of netplan. The reason for that was because I had ifupdown installed.
apt purge ifupdown
solved it for me.
Any updates about this problem?
This bug was originally filed in Launchpad as LP: #1847583
Launchpad details
Launchpad user Alexander Weber(deshke) wrote on 2019-10-10T09:40:46.736880+00:00
via https://bugs.launchpad.net/cloud-init/+bug/1846535/comments/40
steps to reproduce: (ec2/gcp/azure all have the same issue)
19.2-36-g059d049c-0ubuntu2~18.04.1
cloud-init-output.log
mounting the disk to another instance, one can see that the netplan configuration was not updated