canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.87k stars 856 forks source link

cloud-init network error when using MAAS/juju #2469

Closed ubuntu-server-builder closed 3 months ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1345433

Launchpad details
affected_projects = ['juju-core']
assignee = smoser
assignee_name = Scott Moser
date_closed = None
date_created = 2014-07-20T05:00:02.976176+00:00
date_fix_committed = None
date_fix_released = None
id = 1345433
importance = medium
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1345433
milestone = None
owner = cperz
owner_name = HeinMueck
private = False
status = confirmed
submitter = cperz
submitter_name = HeinMueck
tags = ['cloud-init', 'juju', 'maas', 'network']
duplicates = []

Launchpad user HeinMueck(cperz) wrote on 2014-07-20T05:00:02.976176+00:00

I am setting up a testing environment for Trusty, MAAS and juju, all on KVM's. I provisioned 5 machines through MAAS and juju add-machine, all went fine.

When the systems reboot - all 5 of them - the network configuration fails and thus the cloud init. You get the "cloud-init-nonet wating for 120 seconds ..." and then the subsequent "gave up waiting ..." messages.

I added some outputs to cloud-init-nonet, for that the funny lines in the boot.log. What is interesting that after all the wating and doing things, finally you get a login prompt. And the network is up as it should!

I added bridge_maxwait 0 to the br0 stanza in /etc/network/eth0.config, but that does not change anything. Upstart job logs do not show any errors.

I have tested the br0 configuration on a fresh 14.04 server installation with no cloud features and it works without any changes.

I thought maybe firewall issues, so I removed ufw.conf for a change - reboot was the same.

All used packages used are the latest 14.04 repositories, freshly updated.

Related bugs:

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-07-20T05:00:02.976176+00:00

Launchpad attachments: Systems boot.log

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-07-21T14:13:43.213551+00:00

There is a lot of information missing from your bug report. I'm not sure how 'br0' comes into play at all here.

Unless you've modified something, the installed system will not expect or use a 'br0'. And if you have modified something, then you're going to need to explain what that is that you've modified.

cloud-init-nonet upstart job blocks further boot until all 'auto' network adapters in /etc/network/interfaces (including those sourced from /etc/network/interfaces.d/*) are up. That is by design and I suspect not broken.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-07-21T14:54:36.580897+00:00

I am confused. But maybe thats normal. Anyway, br0 gets created when you provision a bare metal using juju. Its the bridge for possible lxc containers you might want to adress when deploying charms.

But yesterday I did not know how br0 was actually put in place. I could see the config files, but not where they came from.

Know better today: found this in /var/lib/cloud/instances/node-fb671a3c-0a47-11e4-8074-5254000e3fd6/scripts/runcmd

------------[snip]--------------- ifdown eth0 cat > /etc/network/eth0.config << EOF iface eth0 inet manual

auto br0 iface br0 inet dhcp bridge_ports eth0 EOF

sed -i "s/iface eth0 inet dhcp/source \/etc\/network\/eth0.config/" /etc/network/interfaces ifup br0

------------[snap]---------------

This must be coming from some juju-juju.

  1. When I activate only eth0 and boot, all is fine, dhcp working
  2. When I activate the br0 setup, cloud-init-nonet waits forever and gives up, although I see the dhcp requests coming in on the server and I see later that the interfaces are up and working.

Seems to me that for some reason the upstart events needed by cloud-init-nonet to stop are lost or not fired.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-07-28T03:51:02.343622+00:00

I think I found the reason.

The concept used to stop cloud-init-nonet will only work if all your interfaces are physical and have a correspondent kernel object. These interfaces will be set up by the network-interface job, which will emit static-network-up when done.

As soon you have a bridge in your system, setup of virtual interfaces is handled by the networking job, emitting static-network-up will be postponed until this job has finished. cloud-init-nonet fires up before the networking job and prevents it from being started - it will wait the full time and then give up.

Systemconfiguration continues as expected and the networking will be configured correctly - but cloud-init will not care as /var/lib/cloud/data/no-net will exist.

I am looking for a solution now. Trying to make sure that cloud-init-nonet is fired after the networking job has been started.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-07-28T05:21:12.010031+00:00

Just a question: Why not using the static-network-up event to start the network dependant cloud-init instead of stalling the complete boot process while waiting on it? I'm sure you had this discussion, but I could not find it.

I would guess that some actions of cloud-init rely on the no-net file to figure out if they could run or not. Turn it to has-network.

Stalling seems a bit against the whole concept of upstart.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-07-31T13:15:20.390060+00:00

Here is a patch that is now working for me. Edited some old copy-paste comments too.

It works for system with physical interfaces and virtual interfaces defined in /etc/network/interfaces. I have looked into network-manager, but I dont see a chance to make this work without actually extending all network scripts to fit with the current cloud-init architecture. But I may well be wrong here.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-07-31T13:16:11.527012+00:00

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-08-01T20:15:45.501459+00:00

We block boot by design. If we didn't block boot, then there would be no way to guarantee consumption of user-data would take place at a defined point during boot.

As it is done right now, you can be assured that modifying just about any service or config file during a boothook will have the same affect as doing so before you booted the instance entirely.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-08-01T20:19:48.177374+00:00

if you're posting a patch, please do so in 'diff -u' format. the default output of diff is pointlessly useless (not your fault, probably 'diff -u' should be its default).

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-08-02T00:02:53.049808+00:00

Launchpad attachments: Patch for upstart jobs

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-09-08T20:52:26.804977+00:00

attaching a doc on how to attempt a reproduce. I've not been successful yet though. David reports probably 1/1000 instances hit this.

Launchpad attachments: doc on how to attempt reproduce

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-09-09T00:54:44.087560+00:00

Scott,

you're sure you are commenting on the right bug? Looks like your attempt was ment for another problem.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-09-09T18:46:50.310567+00:00

yes, this is the right bug. the reproduce shows how to do this and demonstrates the bug that they're hitting outside of juju, using the tools that juju uses.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-09-09T22:50:10.875918+00:00

Then let me show you how I reproduced the bug I wrote this report for:

  1. add to /etc/network/interfaces on any ubuntu using cloud-init

auto br0 iface br0 inet dhcp bridge_ports eth0

  1. reboot

Failrate is 100%. Maybe the references made the original problem look a bit bigger than it was before. The patch tries to fix this one by also taking virtual networking into the equation.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-09-10T13:51:38.476412+00:00

i was wrong. wrong bug. sorry for confusion.

ubuntu-server-builder commented 1 year ago

Launchpad user Janghoon-Paul Sim(janghoon) wrote on 2014-09-26T09:58:23.086115+00:00

I have the same issue. This issue only happens on Trusty deployed by juju with MaaS provider. On precise, cloud-init brings br0 correctly.

One more Related bug, https://bugs.launchpad.net/juju-core/+bug/1271144

ubuntu-server-builder commented 1 year ago

Launchpad user gadLinux(gad-aguilardelgado) wrote on 2014-10-11T23:53:32.473947+00:00

This happens to me also. Let me clarify things a little bit. From my point of view:

When you are provisioning a server with MAAS it expect all networks go physical. But in my case I have to deploy servers to an openstack network as compute nodes. So I have to bridge all interfaces with openvswitch. This is my choice, and I did this way.

ovs-vsctl show 8d08d8e4-49f2-4243-b1db-7641984a8530 Bridge br-ex Port br-ex Interface br-ex type: internal Port phy-br-ex Interface phy-br-ex Bridge "br0" Port "br0" Interface "br0" type: internal Port "phy-br0" Interface "phy-br0" Bridge br-ext Port br-ext Interface br-ext type: internal Bridge br-int fail_mode: secure Port "int-br0" Interface "int-br0" Port "qvoea759488-64" tag: 1 Interface "qvoea759488-64" Port "qvod1b2e6dc-7f" tag: 1 Interface "qvod1b2e6dc-7f" Port "eth1" Interface "eth1" Port br-int Interface br-int type: internal Port int-br-ex Interface int-br-ex Port "qvoc0571326-4c" tag: 1 Interface "qvoc0571326-4c" ovs_version: "2.0.2"

In my case br-int communicates with the internal integration network. Where MAAS is available for me.

br-int Link encap:Ethernet direcciónHW 68:05:ca:1a:09:50
Direc. inet:172.16.0.100 Difus.:172.16.31.255 Másc:255.255.0.0 Dirección inet6: fe80::e46e:23ff:fe3c:c279/64 Alcance:Enlace ACTIVO DIFUSIÓN FUNCIONANDO MTU:1500 Métrica:1 Paquetes RX:577447 errores:0 perdidos:0 overruns:0 frame:0 Paquetes TX:724745 errores:0 perdidos:0 overruns:0 carrier:0 colisiones:0 long.colaTX:0 Bytes RX:3114547329 (3.1 GB) TX bytes:908737979 (908.7 MB)

MAAS is 172.16.0.40.

What I think it happens is that ubuntu brings up all interfaces but openvswitch. Then it tries to see if network is up, but it's not because the bridges and rest of the config is not done. OpenVSwitch config is still not started. So it waits for network interfaces. And finally fails.

After a while of waiting, it starts the openvswitch stuff and brings the net up. But cloud-init already failed. In my case is worse even because I depend on ceph and ceph is started even after of openvswitch so it also fails, and starts failed. No more recover. So I have to do it manually after it boots.

I think solution is to bring openvswitch config with the rest of the system. I rode about it but it seems that I cannot set configuration it /etc/network/interfaces since it does not work. So I'm sticking with ovsdb config.

What do you think? Can this be the problem?

ubuntu-server-builder commented 1 year ago

Launchpad user gadLinux(gad-aguilardelgado) wrote on 2014-10-12T00:02:49.272986+00:00

Just for reference http://askubuntu.com/questions/535988/cloud-init-nonet

ubuntu-server-builder commented 1 year ago

Launchpad user Alex Tomkins(tomkins) wrote on 2014-10-25T14:55:13.911759+00:00

Having annoying issues with this exact problem, every boot takes an additional 130 seconds because I have a bridge interface configured.

Comments #4 and #14 describe this perfectly. Add a bridge interface, but because bridge interfaces aren't real and don't emit a net-device-added because they aren't physical - the bridge is configured by networking.conf instead - but this is too late.

ubuntu-server-builder commented 1 year ago

Launchpad user HeinMueck(cperz) wrote on 2014-10-25T23:39:19.725675+00:00

@gadLinux - as Alex says, comments 4 and 14 have the reason, comment 10 the patch. It hits you as soon you have more than ethX in your system.

What bugs me is not that it takes some more time to boot, but that cloud-init thinks there is not network and thus will not run. Not do its job. Umpf.

And of course if you run into a problem like that during boot, you end up with declaring the status of your machine as unknown, because you never know which package you installed later that also subsequently fails, maybe silently. You are just likely to have more problems than just not running cloud-init. Pretty deadly for a serious production environment.

I am surprised that this bug does not get much attention.

ubuntu-server-builder commented 1 year ago

Launchpad user Alex Tomkins(tomkins) wrote on 2014-10-27T15:27:51.943363+00:00

Old style interface aliases will also cause the same problem, eg. eth0:1 - another interface which isn't physical so no signal gets emitted for it.

ubuntu-server-builder commented 1 year ago

Launchpad user Dimiter Naydenov(dimitern) wrote on 2014-11-07T10:57:03.204863+00:00

With this change that landed on juju-core trunk: https://github.com/juju/juju/pull/1046 and was also backported to the 1.21 and 1.20 branches, the issues with LXC/KVM containers that got stuck in "pending" state or failed to start due to incorrect networking (bridge name, interface to use, etc.) configuration should be fixed. The bridge name is now "juju-br0" for both LXC and KVM containers on MAAS.

ubuntu-server-builder commented 1 year ago

Launchpad user Larry Michel(lmic) wrote on 2014-11-28T16:43:23.635335+00:00

I think we are hitting an issue associated with this fix. Please see https://bugs.launchpad.net/juju-core/+bug/1395908.

The problem I am seeing is juju-br0 bridge being configured with eth2:

auto lo iface eth3 inet dhcp iface eth4 inet dhcp iface eth5 inet dhcp auto eth0 iface eth0 inet dhcp iface eth1 inet dhcp iface eth2 inet manual auto juju-br0 iface juju-br0 inet dhcp bridge_ports eth2

Problem is that we only have eth0 connected and eth2 is disconnected. How is it determining which interface to use for juju-br0?

ubuntu-server-builder commented 1 year ago

Launchpad user Larry Michel(lmic) wrote on 2014-11-28T20:57:45.053789+00:00

This looks to be re-creatable on servers that have 2 types of network card. It's looking like a regression.

ubuntu-server-builder commented 1 year ago

Launchpad user Larry Michel(lmic) wrote on 2014-11-28T21:00:35.534344+00:00

On servers with one type of network card, we have no issue. I have one server where the 2nd set of cards is not configured as eth2..ethX and juju-br0 still gets configured with eth2. I have added the details to bug 1395908.

ubuntu-server-builder commented 1 year ago

Launchpad user Dimiter Naydenov(dimitern) wrote on 2014-12-01T07:44:08.453049+00:00

@Larry,

Can you please attach the lshw XML output stored in your MAAS for those machines with 2 types of cards (eth0, eth2)?

I suspect the machines originally had a single interface, were commissioned OK by MAAS, then their hardware had changed (i.e. added an extra NIC or rearranged which NIC is first ?) without recommissioning them again. Since the juju-br0 fix, juju now tries to discover the primary NIC on the machine (the one to plug into juju-br0) using the lshw XML dump of the machine. There is a bug #1390404 I filed against MAAS that will improve this and allow us to discover what NICs are there via the MAAS API rather than with such workarounds.

ubuntu-server-builder commented 1 year ago

Launchpad user Dimiter Naydenov(dimitern) wrote on 2014-12-02T06:55:28.892602+00:00

After analysis of the issue I can confirm bug #1395908 describes a deviant case where most of the network interfaces on nodes MAAS provisions for Juju are disabled. This fix does not take into account disabled="true" on elements when parsing the lshw XML dump discovered during commissioning.

There will be a fix for this issue with backports for 1.20.14 and 1.21.

ubuntu-server-builder commented 1 year ago

Launchpad user gadLinux(gad-aguilardelgado) wrote on 2014-12-14T10:59:59.879132+00:00

Hi @cperz and @tomkins,

I must say that I spend my morning compiling and patching cloudinit 0.7.5 revision against the patch in comment#10 and the system is the same.

It maybe because I'm using ovs (openvswitch) but I see no improvement.

Every. I say EVERY server I build to integrate with openstack has this problem. I'm evaluating drop MAAS since cloud-init stuff seems to be the cause of all headache.

And in effect. Is incredible that nobody notices this, maybe because nobody uses MAAS+Openstack to do deployment.

This bug is marked as Importance Medium but since you said @cperz, it leaves all my other demons unconfigured against the wrong devices because init did not the job right. Yes after a wayt of more than 120secs it starts and brings up the interfaces because all openvswitch stuff is run after the timeout. But it leaves everything that required network in a stale state that makes demons unable to connect the network.

I have to start everything by hand and solve problems manually.

And all because nobody cared about systems with more than one interface and some bridges. Unbeliverable!

My cluster is small and I can deal with it, but what does the other do?

Please increase importance of this bug. Since it's not solved.

Thank you a lot in advance.

ubuntu-server-builder commented 1 year ago

Launchpad user gadLinux(gad-aguilardelgado) wrote on 2014-12-14T11:48:17.358585+00:00

In fact the servers take more than 120 secs.

It can measure in about 5-10 min, depending on the number of bridges. It's really a mess.

ubuntu-server-builder commented 1 year ago

Launchpad user james beedy(jamesbeedy) wrote on 2015-04-08T01:26:04.168299+00:00

I am affected by some variant this bug.....When I provision nodes with multiple interfaces and eth0 is not used. I am experiencing a slew ....no wait .... a plethora of different bugs revolving around this......to many to accurately write up from my phone.....I will give a complete write up tonight or tomorrow. Bottom line ....can we get this fixed asap?

ubuntu-server-builder commented 1 year ago

Launchpad user gadLinux(gad-aguilardelgado) wrote on 2015-08-22T17:52:41.088914+00:00

Today I did a lot of testing. The situation is as follows.

If you has a bridge that has to become up with open vswitch and you tell /etc/network/interfaces that this bridge has to have an ip address assigned by dhcp after it's created. Then cloud-init will wait for it. Since openvswitch is one of the latest things is activated the init will hang until cloud-init gives up.

ubuntu-server-builder commented 1 year ago

Launchpad user Karan(karan-pugla) wrote on 2016-02-15T00:41:55.896483+00:00

This is how I made it work while using openvswitch bridges and openstack compute nodes. cloud-init-nonet blocks all other services until it receives event 'static-network-up'. Event 'static-network-up' will be emitted only when all the interfaces configured 'auto' in /etc/network/interfaces are up. Script responsible for emitting ' static-network-up' is at '/etc/network/if-up.d/upstart'. Since openvswitch service is blocked when cloud-init-nonet is active, bridges created by openvswitch which are also marked auto in /etc/network/interfaces will not come up and the event 'static-network-up' will not be emitted.

So the solution is to use interfaces sybsystem hotplug . Instead of declaring the bridge as auto, declare it as allow-ovs:

BEFORE:

auto eth0 iface eth0 inet manual

auto br-ex iface br-ex inet static pre-up ifup eth0 gateway 10.10.0.6 address 10.10.208.135/16 mtu 1500 post-down ifdown eth0

AFTER:

auto eth0 iface eth0 inet manual

allow-ovs br-ex iface br-ex inet static pre-up ifup eth0 gateway 10.10.0.6 address 10.10.208.135/16 mtu 1500 post-down ifdown eth0

This will bring up all the devices on boot and statically assign IPs. This works because openvswitch init service specifically brings up interfaces on boot which are declared with 'ovs' hotplug.

holmanb commented 3 months ago

The networking code from 2014 and 2024 have very little in common. Ubuntu now uses systemd. Cloud-init now uses Python 3. I'm sure that much has changed in the other technologies mentioned here.

I'm closing, since this bug seems pretty irrelevant to the cloud-init of today. Please file a new bug if you experience something similar to this.