canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.88k stars 857 forks source link

No support for classless-static-routes on centos 7 #3487

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1850642

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2019-12-31T12:13:46.555404+00:00
date_created = 2019-10-30T13:54:05.635026+00:00
date_fix_committed = 2019-12-31T12:13:46.555404+00:00
date_fix_released = 2019-12-31T12:13:46.555404+00:00
id = 1850642
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1850642
milestone = None
owner = hkominos
owner_name = Harry Kominos
private = False
status = fix_released
submitter = hkominos
submitter_name = Harry Kominos
tags = []
duplicates = []

Launchpad user Harry Kominos(hkominos) wrote on 2019-10-30T13:54:05.635026+00:00

In a tripleo Rocky deployment I am seeing the behaviour bellow.

In the bootup logs I see that the metadata service could not be reached.

File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceOpenStack.py", line 177, in _crawl_metadata 'No active metadata service found')

However the service is curlable from inside the VM and I am also seeing some requests in the metadata-service in Openstack. Furthemore I am getting SSH keys inserted so I believe this might be a false warning.

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-10-30T13:54:05.635026+00:00

Launchpad attachments: cloud-init.tar.gz

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-10-30T14:33:47.067215+00:00

I think this is a duplicate of bug 1801364.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-10-30T14:34:35.861414+00:00

un-dupe this if you think my diagnosis is wrong.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-10-30T14:46:36.669565+00:00

un-marked as a duplicate. cloud-init WARNed twice in this log. a.) openstack local datasource failed to read metadata service (timed out) b.) openstack network datasource worked, but failed to persisting instance-data.json. This is bug 1801364 I'm pretty sure.

I think the reason for 'a' is the fact that the dhcp lease gave an explicit route to 169.254.169.254, but cloud-init doesn't know how to read that in its EphemeralDhcp network. See:

+++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++++ +-------+-----------------+----------------+-----------------+-----------+-------+ | Route | Destination | Gateway | Genmask | Interface | Flags | +-------+-----------------+----------------+-----------------+-----------+-------+ | 0 | 0.0.0.0 | 136.156.91.254 | 0.0.0.0 | eth0 | UG | | 1 | 136.156.90.0 | 0.0.0.0 | 255.255.254.0 | eth0 | U | | 2 | 169.254.169.254 | 136.156.90.11 | 255.255.255.255 | eth0 | UGH | +-------+-----------------+----------------+-----------------+-----------+-------+

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-10-30T15:02:47.337915+00:00

The logs do contain that bug, but I'm not sure that's the failure here.

Looking at the cloud-init logs, we can see the Ephemeral DHCP start and obtain an lease, but the end point does not respond:

2019-10-30 13:28:41,516 - dhcp.py[DEBUG]: Received dhcp lease on eth0 for 136.156.90.74/255.255.254.0 2019-10-30 13:28:41,516 - init.py[DEBUG]: Attempting setup of ephemeral network on eth0 with 136.156.90.74/23 brd 136.156.91.255 2019-10-30 13:28:41,516 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '136.156.90.74/23', 'broadcast', '136.156.91.255', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:28:41,519 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:28:41,522 - util.py[DEBUG]: Running command ['ip', 'route', 'show', '0.0.0.0/0'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:28:41,525 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '136.156.91.254', 'dev', 'eth0', 'src', '136.156.90.74'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:28:41,527 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '136.156.91.254', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:29:21,573 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.044 seconds 2019-10-30 13:29:21,574 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/openstack' with {'url': 'http://169.254.169.254/openstack', 'headers': {'User-Agent': 'Cloud-Init/18.5'}, 'allow_redirects': True, 'method': 'GET', 'timeout': 10.0} configuration 2019-10-30 13:29:31,608 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [10/-1s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /openstack (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdc5e5dc210>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)'))] 2019-10-30 13:29:31,608 - DataSourceOpenStack.py[DEBUG]: Giving up on OpenStack md from ['http://169.254.169.254/openstack'] after 10 seconds 2019-10-30 13:29:31,609 - util.py[DEBUG]: Crawl of metadata service took 50.079 seconds

Cloud-init did not find the OpenStack datasource at local time:

2019-10-30 13:29:31,689 - main.py[DEBUG]: No local datasource found

So we write out a fallback network config (dhcp on eth0).

2019-10-30 13:29:31,722 - stages.py[INFO]: Applying network configuration from fallback bringup=False: {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'eth0', 'mac_address': 'fa:16:3e:f4:b5:1d'}]} 2019-10-30 13:29:31,723 - init.py[DEBUG]: Selected renderer 'sysconfig' from priority list: None 2019-10-30 13:29:31,725 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-eth0 - wb: [644] 159 bytes 2019-10-30 13:29:31,726 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-eth0 (recursive=False) 2019-10-30 13:29:31,726 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-eth0 (recursive=False)

And then exit cloud-init-local.service:

2019-10-30 13:29:31,730 - util.py[DEBUG]: cloud-init mode 'init' took 52.389 seconds (52.39)

Now, the OS networking layer comes up:

Oct 30 13:29:31.623535 localhost.localdomain cloud-init[732]: 2019-10-30 13:29:31,619 - util.py[WARNING]: No active metadata service found Oct 30 13:29:31.751295 localhost.localdomain systemd[1]: Started Initial cloud-init job (pre-networking). Oct 30 13:29:31.754472 localhost.localdomain systemd[1]: Reached target Network (Pre). Oct 30 13:29:31.756812 localhost.localdomain systemd[1]: Starting LSB: Bring up/down networking... Oct 30 13:29:31.961050 localhost.localdomain network[866]: Bringing up loopback interface: [ OK ] Oct 30 13:29:31.990878 localhost.localdomain network[866]: Bringing up interface eth0: Oct 30 13:29:32.034411 localhost.localdomain dhclient[995]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 (xid=0xeceb89e) Oct 30 13:29:32.055157 localhost.localdomain dhclient[995]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0xeceb89e) Oct 30 13:29:32.055188 localhost.localdomain dhclient[995]: DHCPOFFER from 136.156.90.11 Oct 30 13:29:32.055796 localhost.localdomain dhclient[995]: DHCPACK from 136.156.90.11 (xid=0xeceb89e) Oct 30 13:29:34.132189 localhost.localdomain NET[1054]: /usr/sbin/dhclient-script : updated /etc/resolv.conf Oct 30 13:29:34.166284 host-136-156-90-74 dhclient[995]: bound to 136.156.90.74 -- renewal in 34571 seconds. Oct 30 13:29:34.167466 host-136-156-90-74 network[866]: Determining IP information for eth0... done. Oct 30 13:29:34.224637 host-136-156-90-74 network[866]: [ OK ] Oct 30 13:29:34.245397 host-136-156-90-74 systemd[1]: Started LSB: Bring up/down networking. Oct 30 13:29:34.249783 host-136-156-90-74 systemd[1]: Reached target Network.

And cloud-init.service starts up, and this time it is able to connect to the IMDS:

2019-10-30 13:29:34,638 - init.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceOpenStack.DataSourceOpenStack'> 2019-10-30 13:29:34,638 - init.py[DEBUG]: Update datasource metadata and network config due to events: New instance first boot 2019-10-30 13:29:34,638 - util.py[DEBUG]: Running command ['systemd-detect-virt', '--quiet', '--container'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:29:34,641 - util.py[DEBUG]: Running command ['running-in-container'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:29:34,644 - util.py[DEBUG]: Running command ['lxc-is-container'] with allowed return codes [0] (shell=False, capture=True) 2019-10-30 13:29:34,648 - util.py[DEBUG]: Reading from /proc/1/environ (quiet=False) 2019-10-30 13:29:34,648 - util.py[DEBUG]: Read 0 bytes from /proc/1/environ 2019-10-30 13:29:34,648 - util.py[DEBUG]: Reading from /proc/self/status (quiet=False) 2019-10-30 13:29:34,648 - util.py[DEBUG]: Read 1206 bytes from /proc/self/status 2019-10-30 13:29:34,648 - util.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name 2019-10-30 13:29:34,648 - util.py[DEBUG]: Reading from /sys/class/dmi/id/product_name (quiet=False) 2019-10-30 13:29:34,648 - util.py[DEBUG]: Read 18 bytes from /sys/class/dmi/id/product_name 2019-10-30 13:29:34,648 - util.py[DEBUG]: dmi data /sys/class/dmi/id/product_name returned OpenStack Compute 2019-10-30 13:29:34,654 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 0.006 seconds 2019-10-30 13:29:34,654 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/openstack' with {'url': 'http://169.254.169.254/openstack', 'headers': {'User-Agent': 'Cloud-Init/18.5'}, 'allow_redirects': True, 'method': 'GET', 'timeout': 10.0} configuration 2019-10-30 13:29:36,359 - url_helper.py[DEBUG]: Read from http://169.254.169.254/openstack (200, 94b) after 1 attempts ... 2019-10-30 13:29:37,717 - util.py[DEBUG]: Crawl of openstack metadata service took 1.358 seconds 2019-10-30 13:29:37,718 - init.py[WARNING]: Error persisting instance-data.json: 'utf8' codec can't decode byte 0xd5 in position 1: invalid continuation byte 2019-10-30 13:29:37,718 - handlers.py[DEBUG]: finish: init-network/search-OpenStack: SUCCESS: found network data from DataSourceOpenStack 2019-10-30 13:29:37,718 - stages.py[INFO]: Loaded datasource DataSourceOpenStack - DataSourceOpenStack [net,ver=2]

So we know we're OpenStack, and as Scott pointed out, it suffers from the py27 json bug.

For this bug, the warning is not a False positive, rather, there seems to be an issue with connecting to the metadata service early in boot. I suspect we'd need some debugging enabled to see why we can't connect to the endpoint after a successful dhclient.

One interesting bit would be to compare the leases file from cloud-init Ephemeral DHCP, with the system dhcp. Another thought is if the image has any custom dhclient hooks that setup additional route information needed to hit the metadata service.

We've seen that bug before,

https://bugs.launchpad.net/cloud-init/+bug/1821102

Which is not in 18.5, so possibly a dupe of that issue.

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-10-30T15:19:09.636069+00:00

Attaching the lease file Launchpad attachments: dhclient--eth0.lease

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-10-30T15:35:53.067325+00:00

Un-duped again.

Ryan was on the right path... this is very close to bug 1821102. But the lease there has a "rfc3442-classless-static-routes" entry while Harry's has a 'classless-static-routes' entry. And they're unfortunately different formats. Fun.

Examples collected in bugs: rfc3442-classless-static-routes 32,169,254,169,254,130,56,248,255,0,130,56,240,1; option classless-static-routes 32.169.254.169.254 136.156.90.10,0 136.156.91.254;

I guess possibly the centos 7 dhclient is just formatting the response differently.

The lease that Harry posted looks like below:

lease { interface "eth0"; fixed-address 136.156.90.65; option subnet-mask 255.255.254.0; option routers 136.156.91.254; option dhcp-lease-time 86400; option dhcp-message-type 5; option domain-name-servers 136.156.81.192; option dhcp-server-identifier 136.156.90.10; option interface-mtu 1500; option dhcp-renewal-time 43200; option classless-static-routes 32.169.254.169.254 136.156.90.10,0 136.156.91.254; option broadcast-address 136.156.91.255; option dhcp-rebinding-time 75600; option host-name "host-136-156-90-65"; option domain-name "openstacklocal"; renew 4 2019/10/31 02:14:16; rebind 4 2019/10/31 12:15:38; expire 4 2019/10/31 15:15:38; }

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-10-30T15:38:58.517373+00:00

cloud-init version used was cloudcloud-init-18.5-3.el7.centos.x86_64

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-10-30T16:35:21.838376+00:00

The value:

option classless-static-routes 32.169.254.169.254 136.156.90.10,0 136.156.91.254;

Doesn't look valid to me, the first period seems to be misplaced; it's either all comma separated, or space-separated list of IP/mask values.

Can we get the dhcp client's hook scripts? that's usually where the classless-static-route parsing is handled.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-10-30T16:49:17.150564+00:00

https://bugzilla.redhat.com/show_bug.cgi?id=1109949

This suggests that this is another format, internal to the dhclient.

Once the system is up, can you include the output from:

% ip -d route show

And any dhclient related config or log files?

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-10-30T17:02:34.345720+00:00

I think the format is:

NETMASK.DESTINATION GATEWAY

Comma is the record separator

0 is alias for default gateway.

-- With that, we'd have two routes:

ip route add 169.254.169.254/32 via 136.156.90.10 ip route add default via 136.156.91.254

But I'd really like to see some source code of dhcp client to confirm that this is the format being written.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-10-30T17:06:32.922337+00:00

@Harry, can you please post: $ rpm -q dhclient $ cat /usr/sbin/dhclient-script

then set the bug back to new.

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-10-30T17:15:35.139590+00:00

ip -d route show unicast default via 136.156.91.254 dev ens3 proto boot scope global unicast 136.156.90.0/23 dev ens3 proto kernel scope link src 136.156.90.31 unicast 169.254.169.254 via 136.156.90.11 dev ens3 proto static scope global

As for dhclient hooks, I see nothing in /etc/dhcp/. Is there some other location ?

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-10-30T17:19:46.695735+00:00

rpm -qa |grep dhclient dhclient-4.2.5-77.el7.centos.x86_64

I have also attached the dhclient hook. Launchpad attachments: dhclient_hook.sh

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-10-30T17:36:44.697278+00:00

Thanks Harry,

With the dhclient-script, I've found the parsing logic. Cloud-init will need to handle this format as well.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-10-30T17:40:58.465400+00:00

Looking at the hook script, I see the record separator is ', |'

if [ -n "${new_classless_static_routes}" ]; then IFS=', |'

And then, each entry is either 0 (default route) or, it's split on '.'

if [ ${target} = "0" ]; then new_routers="${static_routes[$i+1]}" continue else prefix=${target%%.} target=${target#.} IFS="." target_arr=(${target}) unset IFS ((pads=4-${#target_arr[@]})) for j in $(seq $pads); do target="${target}.0" done fi

ubuntu-server-builder commented 1 year ago

Launchpad user Eric Lafontaine(elafontaine) wrote on 2019-11-25T13:47:28.138705+00:00

Hi all, first time commenting, so please do critic :).

My opinion on the matter is that is we use dhclient to do the leasing, maybe we should use it's interface to get the DNS option instead. That way we would avoid having to maintain a mapping of different naming convention.

Now, I understand that dhclient doesn't expose such mapping and that it's decided at compile time; elafontaine@centos7# strings /sbin/dhclient | grep -i classless classless static routes option has wrong size or there's some garbage in format classless-static-routes

Maybe we could try to find a way to interface by having our own set of options asked to avoid that trouble?

so a dhclient.conf that would have an include statement on the standard dhclient.conf ?

Thoughts?

BTW, I fixed this issue on my side by having dhclient.conf just rename the classless-static-route option to rfc3442-classless-static-routes.

ubuntu-server-builder commented 1 year ago

Launchpad user Eric Lafontaine(elafontaine) wrote on 2019-11-26T22:26:26.110570+00:00

my bad, the rfc3442-class-static-routes was working due to another hook. just setting the option doesn't solve the issue.

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2019-12-18T17:55:33.470551+00:00

This bug is believed to be fixed in cloud-init in version 19.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-12-31T11:23:20.323165+00:00

Hi again. I have rebuild a centos image and integrated cloud-init 19.4+514.g4a4e26a4-1.el7 from https://copr.fedorainfracloud.org/coprs/g/cloud-init/el-testing/packages/

Again I believe the same issue is present. No metadata source is found in the network.

Logs are provided from both these bootups Launchpad attachments: Bootup with a networking source

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-12-31T11:24:28.600359+00:00

As a workaround I force the metadata source to come as a disk and that works as expected. I have attached a log to show how I would expect cloud init to work when network is the default metadata source Launchpad attachments: cloud-init.disksource.tar.gz

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-12-31T11:25:06.816480+00:00

Bug re-opened

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-12-31T11:32:27.500762+00:00

Attaching the leases file that I get when I force the metadata source to be the network.

Launchpad attachments: leases.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Harry Kominos(hkominos) wrote on 2019-12-31T12:13:25.803172+00:00

I am going to close this. The issue is appears to be resolved. The logs above do not represent accurate findings since the metadata service was not replying (for some reason).