canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.88k stars 857 forks source link

cloud-init fails to add default route during _bringup_static_routes #3643

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1871323

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2020-06-15T04:17:15.715226+00:00
date_created = 2020-04-07T08:20:20.669684+00:00
date_fix_committed = None
date_fix_released = None
id = 1871323
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1871323
milestone = None
owner = eru1729
owner_name = Ilwoo Park
private = False
status = expired
submitter = eru1729
submitter_name = Ilwoo Park
tags = []
duplicates = []

Launchpad user Ilwoo Park(eru1729) wrote on 2020-04-07T08:20:20.669684+00:00

Cloud Provider: OpenStack (Stein) Distro: Ubuntu 16.04 Cloud-init version: 19.4-33-gbb4131a2-0ubuntu1~16.04.1

Problem:

Since cloud-init introduced support of classless static route, cloud-init fails to add route to the gateway in our environment.

Looking through the code, I believe the following code should be patched as follows.

https://github.com/canonical/cloud-init/blob/master/cloudinit/net/__init__.py#L1113

Can someone verify the issue and give comment on suggested fix?

Here's a sample log of cloud-init with DEBUG flag set.

... 2020-04-07 02:51:55,949 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '10.54.62.43/32', 'broadcast', '10.54.62.127', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True) 2020-04-07 02:51:55,951 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True) 2020-04-07 02:51:55,954 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '10.54.62.1/32', 'via', '0.0.0.0', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True) 2020-04-07 02:51:55,956 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '169.254.169.254/32', 'via', '10.54.62.1', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True) 2020-04-07 02:51:55,959 - handlers.py[DEBUG]: finish: init-local/search-OpenStackLocal: FAIL: no local data found from DataSourceOpenStackLocal 2020-04-07 02:51:55,959 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceOpenStack.DataSourceOpenStackLocal'> failed 2020-04-07 02:51:55,959 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceOpenStack.DataSourceOpenStackLocal'> failed Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/sources/init.py", line 760, in find_source if s.update_metadata([EventType.BOOT_NEW_INSTANCE]): File "/usr/lib/python3/dist-packages/cloudinit/sources/init.py", line 649, in update_metadata result = self.get_data() File "/usr/lib/python3/dist-packages/cloudinit/sources/init.py", line 273, in get_data return_value = self._get_data() File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 130, in _get_data with EphemeralDHCPv4(self.fallback_interface): File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 57, in enter return self.obtain_lease() File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 109, in obtain_lease ephipv4.enter() File "/usr/lib/python3/dist-packages/cloudinit/net/init.py", line 986, in enter self._bringup_static_routes() File "/usr/lib/python3/dist-packages/cloudinit/net/init.py", line 1040, in _bringup_static_routes ['dev', self.interface], capture=True) File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2102, in subp cmd=args) cloudinit.util.ProcessExecutionError: Unexpected error while running command. Command: ['ip', '-4', 'route', 'add', '169.254.169.254/32', 'via', '10.54.62.1', 'dev', 'eth0'] Exit code: 2 Reason: - Stdout: Stderr: RTNETLINK answers: Network is unreachable ...

Sample lease file and interface address setup are as follows.

cat /var/lib/dhcp/eth0.lease

lease { interface "eth0"; fixed-address 10.54.62.43; option subnet-mask 255.255.255.255; option routers 10.54.62.1; option dhcp-lease-time 4294967295; option dhcp-message-type 5; option domain-name-servers 10.20.30.40; option dhcp-server-identifier 10.54.62.1; option interface-mtu 1500; option rfc3442-classless-static-routes 32,10,54,62,1,0,0,0,0,32,169,254,169,254,10,54,62,1,0,10,54,62,1; option broadcast-address 10.54.62.127; option host-name "host-10-54-62-43"; option domain-name "local"; renew 0 2088/04/25 06:42:22; rebind 0 2139/05/10 15:07:51; expire 5 2156/05/14 09:56:29; }

ifconfig eth0

eth0 Link encap:Ethernet HWaddr ab:cd:ef:a1:50:a8
inet addr:10.54.62.43 Bcast:10.54.62.127 Mask:255.255.255.255 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:12748 errors:0 dropped:0 overruns:0 frame:0 TX packets:12123 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:50000 RX bytes:1757625 (1.7 MB) TX bytes:1262391 (1.2 MB)

route -n

Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.54.62.1 0.0.0.0 UG 0 0 0 eth0 10.54.62.1 0.0.0.0 255.255.255.255 UH 0 0 0 eth0 169.254.169.254 10.54.62.1 255.255.255.255 UGH 0 0 0 eth0

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2020-04-08T21:07:29.076611+00:00

Thanks for reporting the bug.

Would you be able to run 'cloud-init collect-logs' and attach the tarball? If not, providing /var/log/cloud-init.log would be useful in debugging the issue.

Thanks!

ubuntu-server-builder commented 1 year ago

Launchpad user Ilwoo Park(eru1729) wrote on 2020-04-09T09:18:11.335971+00:00

Thanks for the quick response.

I've attached the cloud-init.log from the affected server. Launchpad attachments: cloud-init.log

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2020-04-14T20:34:10.159086+00:00

From the logs attached it looks to me like either cloud-init is not parsing the dhcp lease returned properly during EphemeralDHCP setup, or the dhcp response from your the dhcpserver on this network is sending out bogus values. It's strange to me to see cloud-init claiming it's setting up a subnet with an inaccessible broadcast addr in your logs.

" Attempting setup of ephemeral network on eth0 with 10.54.62.43/32 brd 10.54.62.127"

From your bug I'm confused why the lease is saying the subnet for the dhcp addr is 255.255.255.255 and the router is at 10.54.62.1. Doesn't that CIDR 10.54.62.43/255.255.255.255 mean that the address has a subnet that is only 1 IP address wide, so it has no visibility to the router?

fixed-address 10.54.62.43; option subnet-mask 255.255.255.255; option routers 10.54.62.1;

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2020-04-14T20:35:13.369459+00:00

From the logs attached it looks to me like either cloud-init setting up an invalid network.

" Attempting setup of ephemeral network on eth0 with 10.54.62.43/32 brd 10.54.62.127"

From your bug I'm confused why the lease is saying the subnet for the dhcp addr is 255.255.255.255 and the router is at 10.54.62.1. Doesn't the CIDR 10.54.62.43/255.255.255.255 mean that the address has a subnet that is only 1 IP address wide, so it has no visibility to the router?

fixed-address 10.54.62.43; option subnet-mask 255.255.255.255; option routers 10.54.62.1;

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2020-04-14T20:58:17.541392+00:00

Given that the router lives at 10.54.62.1 it seems likely that most specific netmask this IP can have would be /26 or 255.255.255.192 in order to still see the router IP on 10.54.62.43's own subnet.

I may be misunderstanding something here though.

ubuntu-server-builder commented 1 year ago

Launchpad user Ilwoo Park(eru1729) wrote on 2020-04-15T06:55:36.684100+00:00

Hi,

You're right about the subnet part. If router ip is inside the network address of the instance current implementation of cloud-init works fine.

Seems like I should clarify how we setup connectivity for each instance.

We're configuring our instance to delegate any packet directed other than itself to the router ip. Hypervisor captures the packet with router ip, then forwards the packet with routing protocol.

To achieve above, we're using /32 bit route prefix to each instance and set up local scope route to the router ip, which the ip cannot fall into the network address of the instance ip range.

Current implementation of cloud-init does not setup local scope routing entry to the router, and this breaks our instance's network configuration.

Let me give you example with instance ip ("10.254.0.2/32") and router ip ("10.254.0.1"), and compare desired result and current cloud-init implementation.

[What we expect] "_bringup_device()"

ip link add name dum0 type dummy

ip -family inet addr add 10.254.0.2/32 broadcast 10.254.0.255 dev dum0

ip link set dev dum0 up

"_bringup_static_route"

ip route add 10.254.0.1/32 dev dum0 (this adds local scope route to the router ip)

ip route add 169.254.169.254/32 via 10.254.0.1 dev dum0

ip route add 0.0.0.0/0 via 10.254.0.1 dev dum0

[Cloud-init behavior]

"_bringup_device()"

ip link add name dum0 type dummy

ip -family inet addr add 10.254.0.2/32 broadcast 10.254.0.255 dev dum0

ip link set dev dum0 up

"_bringup_static_route"

ip route add 10.254.0.1/32 via 0.0.0.0 dev dum0 (this adds global scope route to the router ip)

ip route add 169.254.169.254/32 via 10.254.0.1 dev dum0 (this fails as nexthop gateway is unreachable)

Hope this comment clarify our findings.

If you need more information, please let me know.

Regards, Ilwoo

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2020-06-15T04:17:15.129760+00:00

[Expired for cloud-init because there has been no activity for 60 days.]