canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 883 forks source link

ppc64el / arm64 - issues with cloud-init setting default route #3690

Open ubuntu-server-builder opened 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1879933

Launchpad details
affected_projects = ['maas', 'netplan']
assignee = None
assignee_name = None
date_closed = None
date_created = 2020-05-21T10:52:04.137858+00:00
date_fix_committed = 2020-06-05T14:39:45.303506+00:00
date_fix_released = 2020-06-05T14:39:45.303506+00:00
id = 1879933
importance = undecided
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1879933
milestone = None
owner = admcleod
owner_name = Andrew McLeod
private = False
status = incomplete
submitter = admcleod
submitter_name = Andrew McLeod
tags = []
duplicates = []

Launchpad user Andrew McLeod(admcleod) wrote on 2020-05-21T10:52:04.137858+00:00

This is quite possibly a cloud-init bug.

MAAS version: 2.6.2 (7841-ga10625be3-0ubuntu1~18.04.1)

This problem manifests whether to machine is deployed with juju or manually via the MAAS ui.

This problem is intermittent and I have only seen it affecting arm64 and ppc64el machines (out of 29 machines in total) - all of these machines have 2 interfaces connected to the same fabric in the same subnet - one is set to unassigned to be used as a bridge port / data port for openstack deployments, the other is set to auto assign.

This problem occurs with bionic, eoan and focal deployments.

I have recommissioned the affected machines numerous times, including attempts to update firmware.

Symptoms: when the machine comes up after it is deployed there is no default gateway, e.g.

ubuntu@node-mawhile:/var/log$ ip route 10.245.168.0/21 dev enP5p9s0f0 proto kernel scope link src 10.245.168.63

The rsyslog on the MAAS server shows that the machine is being configured correctly:

https://pastebin.ubuntu.com/p/ZZzQ4q2ZCT/

But the cloud-init log on the machine does not have a default gateway:

https://pastebin.ubuntu.com/p/cCJbF7zhtK/

Additional info:

Something I have observed is that the machines where this problem occurs seem to sometimes have the 'unassigned' interface as the PXE interface, and sometimes the auto-assigned interface. I've tried to force this but the PXE interface moves around by itself.

ubuntu-server-builder commented 1 year ago

Launchpad user Lee Trager(ltrager) wrote on 2020-05-21T21:06:52.935283+00:00

MAAS passes network config to cloud-init which writes it to /etc/netplan/50-cloud-init.yaml and uses netplan to actually apply it once the system has booted. netplan is non-blocking and I've seen cloud-init output incomplete network information even though netplan hasn't finished applying network config.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2020-05-21T22:11:42.455894+00:00

netplan is non-blocking and I've seen cloud-init output incomplete network information even though netplan hasn't finished applying network config

cloud-init calls netplan generate which reads the config passed in from MAAS, and writes out all of the networkd files per the config; this happens before network-online.target is reached, so systemd-networkd runs and cloud-init will not proceed until systemd-networkd-wait-online.service is complete;

systemd-networkd-wait-online.service will wait for all interfaces which have configuration on them.

From the config posted, there's not config for eno1, so this appears to be a output from one config and input from a different system. can you provide the failing out, and the /etc/netplan/50-cloud-init.yaml and /etc/cloud/cloud.cfg.d/50-curtin-networking.cfg files?

cloud-init log on the machine does not have a default gateway

0 | 0.0.0.0 | 10.245.168.1 | 0.0.0.0 | eno1 | UG |

Is this not the default gateway?

And lastly, if your config is using non-standard routing tables like the paste you supplied, ip route will only show routes in the default table, and the default route appears to be in table 1.

routes:

I took the config from your paste andput it in a container, then ran netplan apply

root@g1:~# netplan --debug apply (generate:5092): DEBUG: 21:55:25.895: Processing input file /etc/netplan/50-cloud-init.yaml.. (generate:5092): DEBUG: 21:55:25.895: starting new processing pass (generate:5092): DEBUG: 21:55:25.895: We have some netdefs, pass them through a final round of validation (generate:5092): DEBUG: 21:55:25.895: eth0: setting default backend to 1 (generate:5092): DEBUG: 21:55:25.895: Configuration is valid (generate:5092): DEBUG: 21:55:25.895: Generating output files.. ** (generate:5092): DEBUG: 21:55:25.895: NetworkManager: definition eth0 is not for us (backend 1) (generate:5092): GLib-DEBUG: 21:55:25.895: posix_spawn avoided (fd close requested) DEBUG:netplan generated networkd configuration changed, restarting networkd DEBUG:no netplan generated NM configuration exists DEBUG:eth0 not found in {} DEBUG:Merged config: network: bonds: {} bridges: {} ethernets: eth0: addresses:

DEBUG:Skipping non-physical interface: lo DEBUG:{} DEBUG:netplan triggering .link rules for lo DEBUG:netplan triggering .link rules for eth0 root@g1:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:16:3e:39:6c:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.245.168.63/21 brd 10.245.175.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::216:3eff:fe39:6cf7/64 scope link tentative valid_lft forever preferred_lft forever root@g1:~# ip route 10.245.168.0/21 dev eth0 proto kernel scope link src 10.245.168.63

root@g1:~# ip route show table 1 default via 10.245.168.1 dev eth0 proto static

root@g1:~# ip route show table 254 10.245.168.0/21 dev eth0 proto kernel scope link src 10.245.168.63

And to replicate the cloud-init output:

root@g1:~# python3 -c 'import sys; from cloudinit import netinfo; sys.stderr.write("%s\n" % (netinfo.debug_info()))' ci-info: +++++++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++++++++ ci-info: +--------+------+-------------------------------------------+---------------+--------+-------------------+ ci-info: | Device | Up | Address | Mask | Scope | Hw-Address | ci-info: +--------+------+-------------------------------------------+---------------+--------+-------------------+ ci-info: | eth0 | True | 10.245.168.63 | 255.255.248.0 | global | 00:16:3e:39:6c:f7 | ci-info: | eth0 | True | fd42:f890:56f5:dcfb:216:3eff:fe39:6cf7/64 | . | global | 00:16:3e:39:6c:f7 | ci-info: | eth0 | True | fe80::216:3eff:fe39:6cf7/64 | . | link | 00:16:3e:39:6c:f7 | ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . | ci-info: | lo | True | ::1/128 | . | host | . | ci-info: +--------+------+-------------------------------------------+---------------+--------+-------------------+ ci-info: +++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++ ci-info: +-------+--------------+---------+---------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+--------------+---------+---------------+-----------+-------+ ci-info: | 0 | 10.245.168.0 | 0.0.0.0 | 255.255.248.0 | eth0 | U | ci-info: +-------+--------------+---------+---------------+-----------+-------+ ci-info: ++++++++++++++++++++++++++++++++++Route IPv6 info+++++++++++++++++++++++++++++++++++ ci-info: +-------+--------------------------+---------------------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Interface | Flags | ci-info: +-------+--------------------------+---------------------------+-----------+-------+ ci-info: | 0 | fd42:f890:56f5:dcfb::/64 | :: | eth0 | Ue | ci-info: | 1 | fe80::/64 | :: | eth0 | U | ci-info: | 2 | ::/0 | fe80::b42b:1cff:fed1:3998 | eth0 | UGe | ci-info: | 4 | local | :: | eth0 | U | ci-info: | 5 | local | :: | eth0 | U | ci-info: | 6 | ff00::/8 | :: | eth0 | U | ci-info: +-------+--------------------------+---------------------------+-----------+-------+

I think this matches up to your output, one does not "see" the default route as it's in table 1. So this is expected behavior as cloud-init's netinfo dumps information from 'ip route' output.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2020-05-21T22:12:19.664424+00:00

I believe the cloud-init task is invalid, but let's wait for some more information from submitter.

ubuntu-server-builder commented 1 year ago

Launchpad user Andrew McLeod(admcleod) wrote on 2020-06-04T14:14:47.181366+00:00

Perhaps the bug title should be changed from 'no default route' to 'default route doesn't seem to be relevant if it is not in the default routing table':

I wasn't aware that the default route would be in another table - it is, but it doesn't work. If i add the route to the default table it does work.

ubuntu@node-mawhile:~$ ip route 10.140.121.0/24 dev lxdbr0 proto kernel scope link src 10.140.121.1 10.245.168.0/21 dev enP5p9s0f0 proto kernel scope link src 10.245.168.63 ubuntu@node-mawhile:~$ ip route list table 1 default via 10.245.168.1 dev enP5p9s0f0 proto static

ubuntu@node-mawhile:~$ ip route get 8.8.8.8 RTNETLINK answers: Network is unreachable ubuntu@node-mawhile:~$ sudo ip route add default via 10.245.168.1 dev enP5p9s0f0 ubuntu@node-mawhile:~$ ip route get 8.8.8.8 8.8.8.8 via 10.245.168.1 dev enP5p9s0f0 src 10.245.168.63 uid 1000 cache

pastebin for /etc/netplan/50-cloud-init.yaml https://pastebin.ubuntu.com/p/DbHvkCHtNr/

pastebin for /etc/cloud/cloud.cfg.d/50-curtin-networking.cfg https://pastebin.ubuntu.com/p/KH4R2XMTCN/

Here is the replicated cloud-init output (note the lxd route is there because I launched some containers after adding the route) - note no default route in this output.

https://pastebin.ubuntu.com/p/XtwmcVZxV3/

Running netplan apply --debug doesnt make a difference, the default route is still where it should be, in table 1, but nothing external is reachable.

ubuntu-server-builder commented 1 year ago

Launchpad user Paride Legovini(paride) wrote on 2020-06-05T14:39:37.383809+00:00

Hi,

now my question is: isn't the fact that non-default routing tables are not used by default the expected behavior? IIUC non-default tables need rules to configure when they should be used, e.g.

ip rule add from table

Also, you wrote in the bug description that the problem is intermittent. I think it would be really interesting to see how the config files are and how the routing configured when everything does happen to work. Do you think you can collect the relevant logs?

Thanks!

ubuntu-server-builder commented 1 year ago

Launchpad user Andrew McLeod(admcleod) wrote on 2020-06-08T15:23:47.338344+00:00

It took about 12 deploys - I did nothing but release/deploy (focal) - and I managed to get one that had a functional network:

ubuntu@node-gengar:~$ ip route get 8.8.8.8 8.8.8.8 via 10.245.168.1 dev enP5p9s0f1 src 10.245.168.27 uid 1000 cache

ubuntu@node-gengar:~$ ip route default via 10.245.168.1 dev enP5p9s0f1 proto static 10.245.168.0/21 dev enP5p9s0f1 proto kernel scope link src 10.245.168.27 ubuntu@node-gengar:~$ ip route list table 1 Error: ipv4: FIB table does not exist. Dump terminated ubuntu@node-gengar:~$ ip rule list 0: from all lookup local 32766: from all lookup main 32767: from all lookup default

/etc/netplan/50-cloud-init.yaml https://pastebin.ubuntu.com/p/qxFJCSkyfn/

/etc/cloud/cloud.cfg.d/50-curtin-networking.cfg https://pastebin.ubuntu.com/p/zdDwgVbSJd/

cloud-init https://pastebin.ubuntu.com/p/pySk8r6Cp3/

I'm going to leave this one up and a broken one in case anyone wants any other logs etc.

ubuntu-server-builder commented 1 year ago

Launchpad user Andrew McLeod(admcleod) wrote on 2020-07-14T09:30:21.213738+00:00

Is there anything else I can add to this to help?