canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 880 forks source link

Traceback when there is no OpenStack data source #5152

Open rjschwei opened 7 months ago

rjschwei commented 7 months ago

Bug report

The ds-identify script guesses that we may be in an OpenStack environment on non x86_64 architectures, in this case aarch 64 [1]. This then triggers the enablement of cloud-init services and as such the execution of the Python code. When no data source if found by the OpenStack data source implementation an exception trickles to the top causing a traceback.

2024-04-05 14:34:22,076 - util.py[DEBUG]: No active metadata service found
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/cloudinit/sources/DataSourceOpenStack.py", line 158, in _get_data
    results = util.log_time(
              ^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/cloudinit/util.py", line 2833, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/cloudinit/sources/DataSourceOpenStack.py", line 212, in _crawl_metadata
    raise sources.InvalidMetaDataException(
cloudinit.sources.InvalidMetaDataException: No active metadata service found

and

2024-04-05 14:34:22,107 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/cloudinit/cmd/main.py", line 385, in main_init
    init.fetch(existing=existing)
  File "/usr/lib/python3.11/site-packages/cloudinit/stages.py", line 466, in fetch
    return self._get_data_source(existing=existing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/cloudinit/stages.py", line 357, in _get_data_source
    (ds, dsname) = sources.find_source(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/cloudinit/sources/__init__.py", line 1032, in find_source
    raise DataSourceNotFoundException(msg)
cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: (DataSourceOpenStackLocal)

The exception should be handled and no traceback should be generated.

[1] https://github.com/canonical/cloud-init/blob/main/tools/ds-identify#L1370

Steps to reproduce the problem

Run a VM on aarch64 with cloud-init default config and no config drive.

Environment details

cloud-init logs

cloud-init.tar.gz

rjschwei commented 7 months ago

I think the issue occurs here https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceOpenStack.py#L159 but the call to log_time is inside a try-except block and the exception is supposed to handle InvalidMetaDataException which is raised by _crawl_metadata with the message No active metadata service found which is in the log. So I do not understand why we would still end up with the traceback.

holmanb commented 7 months ago

Ugh, what a mess. This should probably not be the default behavior.

This then triggers the enablement of cloud-init services and as such the execution of the Python code.

Auto-enabling on all non-x86 is really not good. This bug is an example of why this permissive optimism was a bad default.

The only other users of DS_MAYBE in cloud-init, AltCloud and Ec2, only occur in much more limited environments: after positive match on a DMI value or when it is explicitly enabled by a configuration value, respectively.

The more that I think about this the more I think that it was a mistake to try to "just work" for openstack on other architectures without a positive signal.

I'm talking with some openstack folks in the meantime to try to get better openstack support for cloud-init on a few architectures, but I think we should consider making this non-default in a an upcoming cloud-init release. Users that want to use cloud-init on non-x86 can always select openstack in cloud.cfg or in their kernel commandline. Breaking users on some arches just so that other users on those same arches don't have to set a configuration value seems like a poor tradeoff - especially when the tradeoff is caused by a shortcoming of the cloud. Perhaps we should just try to fix openstack instead of depend on broken hacks like this. I think we can get openstack to pass DMI data on a few more arches than it already does, or alternatively it could probably even set the datasource in the kernel commandline.

holmanb commented 6 months ago

Related bug report:

https://github.com/canonical/cloud-init/issues/4106