canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.87k stars 857 forks source link

SmartOS detection broken after 23.1.2 #4299

Closed bahamat closed 1 year ago

bahamat commented 1 year ago

Bug report

Using ubuntu-22.04:

Steps to reproduce the problem

  1. Boot an ubuntu 22 image with cloud-init 23.1.2-0ubuntu0~22.04.1 (e.g., image https://smartos.org/images/#/view/433d1deb-1f17-440c-baec-60fa4034a36e). Note the source is detected properly.
  2. Boot an ubuntu 22 image with cloud-init 23.2.1-0ubuntu0~22.04.1 (no published images due to this bug) OR upgrade previously booted instance and upgrade cloud-init, then reset

Environment details

cloud-init logs

Correct behavior:

2023-07-26 17:10:37,330 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2023-07-26 17:10:37,330 - __init__.py[DEBUG]: Looking for data source in: ['SmartOS', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']
2023-07-26 17:10:37,621 - __init__.py[DEBUG]: Searching for local data source in: ['DataSourceSmartOS']
2023-07-26 17:10:37,622 - handlers.py[DEBUG]: start: init-local/search-SmartOS: searching for local data from DataSourceSmartOS
2023-07-26 17:10:37,622 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceSmartOS.DataSourceSmartOS'>
2023-07-26 17:10:37,623 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name
2023-07-26 17:10:37,623 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2023-07-26 17:10:37,623 - __init__.py[DEBUG]: Machine is running on DataSourceSmartOS [client=JoyentMetadataLegacySerialClient(device=/dev/ttyS1, timeout=60)].

Broken behavior:

2023-07-26 16:47:35,218 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2023-07-26 16:47:35,218 - __init__.py[DEBUG]: Looking for data source in: ['nocloud-net', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']
2023-07-26 16:47:35,251 - __init__.py[DEBUG]: Searching for local data source in: ['DataSourceNoCloud']
2023-07-26 16:47:35,251 - handlers.py[DEBUG]: start: init-local/search-NoCloud: searching for local data from DataSourceNoCloud
2023-07-26 16:47:35,251 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloud'>
2023-07-26 16:47:35,251 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2023-07-26 16:47:35,260 - __init__.py[DEBUG]: Detected platform: DataSourceNoCloud [seed=None][dsmode=net]. Checking for active instance data
TheRealFalcon commented 1 year ago

Thanks for the bug report @bahamat . Unfortunately, these log snippets don't provide the full picture. Can you provide the contents of /run/cloud-init/ds-identify.log on an affected instance?

cc: @igalic

bahamat commented 1 year ago

@TheRealFalcon Here you go:

[up 13.20s] ds-identify
policy loaded: mode=search report=false found=all maybe=all notfound=disabled
DMI_PRODUCT_NAME=SmartDC HVM
DMI_SYS_VENDOR=Joyent
DMI_PRODUCT_SERIAL=1f7da0ba-c3cd-43ac-8f82-9cea1064d6fc
DMI_PRODUCT_UUID=1f7da0ba-c3cd-43ac-8f82-9cea1064d6fc
PID_1_PRODUCT_NAME=unavailable
DMI_CHASSIS_ASSET_TAG=
DMI_BOARD_NAME=unavailable
FS_LABELS=
ISO9660_DEVS=
KERNEL_CMDLINE=BOOT_IMAGE=/vmlinuz-5.19.0-50-generic root=UUID=060772a6-6d81-4983-ba0d-089aabb86937 ro autoinstall console=tty0 console=ttyS0,115200n8 ds=nocloud-net
VIRT=kvm
UNAME_KERNEL_NAME=Linux
UNAME_KERNEL_RELEASE=5.19.0-50-generic
UNAME_KERNEL_VERSION=#50-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 18:24:29 UTC 2023
UNAME_MACHINE=x86_64
UNAME_NODENAME=localhost
UNAME_OPERATING_SYSTEM=GNU/Linux
DSNAME=nocloud-net
DSLIST=nocloud-net
MODE=search
ON_FOUND=all
ON_MAYBE=all
ON_NOTFOUND=disabled
pid=317 ppid=297
is_container=false
datasource 'nocloud-net' specified.
[up 13.59s] returning 0
bahamat commented 1 year ago

Something else I discovered that may be relevant is that we have /etc/cloud/cloud.cfg.d/90_dpkg.cfg (which comes from the package manager) and /etc/cloud/cloud.cfg.d90_smartos.cfg (which comes from our build system). The contents of these files are as follows:

90_dpkg.cfg:

# to update this file, run dpkg-reconfigure cloud-init
datasource_list: [ NoCloud, ConfigDrive, OpenNebula, DigitalOcean, Azure, AltCloud, OVF, MAAS, GCE, OpenStack, CloudSigma, SmartOS, Bigstep, Scaleway, AliYun, CloudStack, Hetzner, IBMCloud, Oracle, Exoscale, RbxCloud, UpCloud, VMware, Vultr, LXD, NWCS, None ]

90_smartos.cfg:

datasource_list: [ SmartOS ]

# Preserve traditional root@<ip> login that was possible with rc.local
disable_root: false

# Do not create the default user
users: [ ]

mounts:
- [ vdb, /data, auto, "defaults,nofail" ]

It may be possible that I can work around this by preseeding debconf with only SmartOS. But even still, the change in behavior when the configuration didn't change is still an issue either way.

TheRealFalcon commented 1 year ago

Why is ds=nocloud-net in your kernel command line? I get that this behavior has changed, but it's arguable that this change is purely a bug fix and your current configuration would be incorrect if you're not expecting a NoCloud datasource.

bahamat commented 1 year ago

That's left over from the image creation process (because of the way it's done it doesn't provide the same metadata facilities as a normal instance), and I didn't realize that was there until I started debugging this issue.

I have a build running now that includes both the dpkg and grub config fixes to see if it helps.

TheRealFalcon commented 1 year ago

@bahamat , any updates here? Were you able to fix your issue by changing configuration?

bahamat commented 1 year ago

@TheRealFalcon I did actually. We had a release yesterday so that took up some of my time over the past few days, but I did manage to get this tested as well.

So what I did with the new image is preseed debconf to include only SmartOS in the 90_dpkg.cfg file, and removed the ds=nocloud, and between the two of those it's behaving correctly again.

I'm still a bit curious as to exactly why the behavior changed, but I can plausibly consider that it is is simply cloud-init now correctly obeys the kernel cmdline, and that a bug fix on your side also revealed a bug on my side.

With that, I think I'm satisfied with the current state. Thanks for helping get this sorted out.

TheRealFalcon commented 1 year ago

I'm still a bit curious as to exactly why the behavior changed

There was an overhaul in how we handled the kernel cli params, and as part of that we ensured nocloud detection.

Given your last comment, I'm going to close this. Feel free to open a new issue if anything else pops up.