canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.74k stars 832 forks source link

cloud-init generates a traceback if a default route already exists during ephemeral network setup #3595

Closed ubuntu-server-builder closed 1 month ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1860164

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = None
date_created = 2020-01-17T18:37:30.886100+00:00
date_fix_committed = None
date_fix_released = None
id = 1860164
importance = medium
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1860164
milestone = None
owner = rjschwei
owner_name = Robert Schweikert
private = False
status = triaged
submitter = rjschwei
submitter_name = Robert Schweikert
tags = []
duplicates = []

Launchpad user Robert Schweikert(rjschwei) wrote on 2020-01-17T18:37:30.886100+00:00

If a route already exists when the ephemeral network exists cloud-init will generate the following traceback:

2020-01-16 21:14:22,584 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceOracle.DataSourceOracle'> failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/cloudinit/sources/init.py", line 760, in find_source if s.update_metadata([EventType.BOOT_NEW_INSTANCE]): File "/usr/lib/python2.7/site-packages/cloudinit/sources/init.py", line 649, in update_metadata result = self.get_data() File "/usr/lib/python2.7/site-packages/cloudinit/sources/init.py", line 273, in get_data return_value = self._get_data() File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceOracle.py", line 195, in _get_data with dhcp.EphemeralDHCPv4(net.find_fallback_nic()): File "/usr/lib/python2.7/site-packages/cloudinit/net/dhcp.py", line 57, in enter return self.obtain_lease() File "/usr/lib/python2.7/site-packages/cloudinit/net/dhcp.py", line 109, in obtain_lease ephipv4.enter() File "/usr/lib/python2.7/site-packages/cloudinit/net/init.py", line 920, in enter self._bringup_static_routes() File "/usr/lib/python2.7/site-packages/cloudinit/net/init.py", line 974, in _bringup_static_routes ['dev', self.interface], capture=True) File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 2083, in subp cmd=args) ProcessExecutionError: Unexpected error while running command.

This is a regression from 19.1 on SUSE where exiting routes were simply skipped.

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Schweikert(rjschwei) wrote on 2020-01-17T18:52:25.821757+00:00

https://github.com/canonical/cloud-init/pull/174

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2020-01-17T19:06:38.016273+00:00

Can you capture cloud-init collect-logs ? In Oracle, I suspect this is related to iscsi root where initramfs already has networking up; in Ubuntu we're collecting the existing configuration from the initramfs as a network-config source and we don't bring up Ephemeral DHCP to crawl IMDS; I wonder if that's missing on SuSE path (knowing whether networking is already up due to initramfs/iscsiroot)?

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Schweikert(rjschwei) wrote on 2020-01-17T19:20:48.022010+00:00

Yes, this is in OCI.

I am not in a position to run cloud-init collect-logs as I am not able to get into a system with cloud-init 19.4 just yet.

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Schweikert(rjschwei) wrote on 2020-01-17T20:15:55.087352+00:00

Whatever the code in net/cmdline does is certainly very distribution specific as we all decided collectively/separately to do things differently. It is not really a surprise that the detection that there is already a network from booting off iscsi is not working.

Thinking there should be a distribution independent way to figure out if we already have a network connection or not.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2020-01-17T20:39:38.708949+00:00

Unfortunately each distro tends to have their own initramfs networking config format. As such, cloudinit/net/cmdline.py has implemented klibc parsing (which Ubuntu/Debian support), but dracut does something different; and I'm not sure what SuSE does here; but adding a parser for the initramfs format used would handle this.

https://github.com/canonical/cloud-init/blob/master/cloudinit/net/cmdline.py#L42

Thinking there should be a distribution independent way to figure out if we already have a network connection or not.

There is, but we need more than "is networking up"; rather we need to translate the existing configuration and merge that with whatever else may come from IMDS; in Oracle the iscsiroot has a permanent dhcp config on a specific interface, however IMDS can provide network config for additional interfaces, so we must merge them. The OCI datasource already does this but distros need to provide an initramfs network config parser to extract the network config generated in the initramfs to something cloud-init can understand.

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Schweikert(rjschwei) wrote on 2020-01-17T21:03:11.523115+00:00

Sorry for being dense, by the time we get to the point where we decide whether or not to bring up an ephemeral network we have long left the initrd and Since we are booting over iscsi the network is up and configured. Any configuration information we might need can be extracted from the network via "ip" commands. Those are distro independent thus a generic "translator" "live_config_to_net_cfg" would work everywhere. What am I missing?

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2020-01-17T21:43:48+00:00

On Fri, Jan 17, 2020 at 15:15 Robert Schweikert 1860164@bugs.launchpad.net wrote:

Sorry for being dense, by the time we get to the point where we decide whether or not to bring up an ephemeral network we have long left the initrd and Since we are booting over iscsi the network is up and configured. Any configuration information we might need can be extracted from the network via "ip" commands. Those are distro independent thus a generic "translator" "live_config_to_net_cfg" would work everywhere. What am I missing?

The initrd supports more than just dhcp or static ip config and ip commands won’t tell you which was used. There may be dns or other options, so it’s best to parse the initramfs format which parses the kernel command line anyhow to bring up networking in the initramfs.

-- You received this bug notification because you are subscribed to the bug report. https://bugs.launchpad.net/bugs/1860164

Title: cloud-init generates a traceback if a default route already exists during ephemeral network setup

To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1860164/+subscriptions

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Schweikert(rjschwei) wrote on 2020-01-17T22:27:38.745785+00:00

Removing myself as assignee as I really have no idea right of the bat what we are after here and I will most likely not have the time to dig into all the gory details.

Here is the doc for what is supported w.r.t. configuration in dracut for these types of situation:

http://man7.org/linux/man-pages/man7/dracut.cmdline.7.html

and in detail the way the network would be configured:

ip={dhcp|on|any|dhcp6|auto6|either6}

       dhcp|on|any
           get ip from dhcp server from all interfaces. If root=dhcp,
           loop sequentially through all interfaces (eth0, eth1, ...)
           and use the first with a valid DHCP root-path.

       auto6
           IPv6 autoconfiguration

       dhcp6
           IPv6 DHCP

       either6
           if auto6 fails, then dhcp6

   ip=<interface>:{dhcp|on|any|dhcp6|auto6}[:[<mtu>][:<macaddr>]]
       This parameter can be specified multiple times.

       dhcp|on|any|dhcp6
           get ip from dhcp server on a specific interface

       auto6
           do IPv6 autoconfiguration

       <macaddr>
           optionally set <macaddr> on the <interface>. This cannot be
           used in conjunction with the ifname argument for the same
           <interface>.

   ip=<client-IP>:[<peer>]:<gateway-IP>:<netmask>:<client_hostname>:<interface>:{none|off|dhcp|on|any|dhcp6|auto6|ibft}[:[<mtu>][:<macaddr>]]
       explicit network configuration. If you want do define a IPv6
       address, put it in brackets (e.g. [2001:DB8::1]). This parameter
       can be specified multiple times.  <peer> is optional and is the
       address of the remote endpoint for pointopoint interfaces and it
       may be followed by a slash and a decimal number, encoding the
       network prefix length.

       <macaddr>
           optionally set <macaddr> on the <interface>. This cannot be
           used in conjunction with the ifname argument for the same
           <interface>.

   ip=<client-IP>:[<peer>]:<gateway-IP>:<netmask>:<client_hostname>:<interface>:{none|off|dhcp|on|any|dhcp6|auto6|ibft}[:[<dns1>][:<dns2>]]
       explicit network configuration. If you want do define a IPv6
       address, put it in brackets (e.g. [2001:DB8::1]). This parameter
       can be specified multiple times.  <peer> is optional and is the
       address of the remote endpoint for pointopoint interfaces and it
       may be followed by a slash and a decimal number, encoding the
       network prefix length.

ifname=: Assign network device name (i.e. "bootnet") to the NIC with MAC .

           Warning
           Do not use the default kernel naming scheme for the interface
           name, as it can conflict with the kernel names. So, don’t use
           "eth[0-9]+" for the interface name. Better name it "bootnet"
           or "bluesocket".

   rd.route=<net>/<netmask>:<gateway>[:<interface>]
       Add a static route with route options, which are separated by a
       colon. IPv6 addresses have to be put in brackets.

       Example.

               rd.route=192.168.200.0/24:192.168.100.222:ens10
               rd.route=192.168.200.0/24:192.168.100.222
               rd.route=192.168.200.0/24::ens10
               rd.route=[2001:DB8:3::/8]:[2001:DB8:2::1]:ens10

   bootdev=<interface>
       specify network interface to use routing and netroot information
       from. Required if multiple ip= lines are used.

   nameserver=<IP> [nameserver=<IP> ...]
       specify nameserver(s) to use

Then there are vlan, bond, bridge, and team kernel command line arguments one could use.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2020-03-18T04:17:27.665439+00:00

[Expired for cloud-init because there has been no activity for 60 days.]

holmanb commented 1 month ago

Thinking there should be a distribution independent way to figure out if we already have a network connection or not.

There is, but we need more than "is networking up"; rather we need to translate the existing configuration and merge that with whatever else may come from IMDS; in Oracle the iscsiroot has a permanent dhcp config on a specific interface, however IMDS can provide network config for additional interfaces, so we must merge them. The OCI datasource already does this but distros need to provide an initramfs network config parser to extract the network config generated in the initramfs to something cloud-init can understand.

I don't think that this really makes sense. Sure, the initramfs may have some dhcp config that it got from the IMDS, but why would that be necessary to merge into the datasource-provided datasource? The issue is just a failure in ephemeral network setup, this failure isn't code that deals with network configuration. Why would you want this?

Any configuration information we might need can be extracted from the network via "ip" commands.

I agree with @rjschwei here, this is far more cross platform, and frankly solves the problem at hand. I'm not sure how merging IMDS networking configuration with an initramfs dhcp thing solves anything related to this issue.

TheRealFalcon commented 1 month ago

@holmanb, The connectivity url was added since this issue was active. I'm pretty sure it sidesteps this issue.

holmanb commented 1 month ago

I'm pretty sure it sidesteps this issue.

@TheRealFalcon I think that you are right on the happy path, but I don't think that the url check is a robust solution to this problem. It makes assumptions which might not be true. If the datasource isn't yet available which causes the connectivity check to fail then this same issue will persist.

There are other (hypothetical) ways in which depending on the connectivity url might be broken or cause undesirable behavior. Imagine a cloud where the image pxe boots on one network but after the initial dhcp/tftp a different network is used for IMDS (i.e. subsequent dhcp responses provide a different route with a longer prefix match to override the previous route). In this case, the pre-existing route would cause the connectivity check to fail after a 5 second timeout, but then proceed and otherwise behave correctly. This example might sound contrived, but a cloud provider should probably not want the instance to be able to access the PXE server from which it booted for multiple security-related reasons.

Additionally, I have a hard time believing that a round trip to an http server would be faster than locally checking the network configuration, so there may be a performance win with removing the connectivity url check altogether once this codepath is more robust.