canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
3.01k stars 887 forks source link

status is error when `cloud-init status --long` only shows warnings #5836

Open hector-gao opened 1 month ago

hector-gao commented 1 month ago

Bug report

The instance boots and network is configured correctly, but cloud-init status is error. In the status report, there is no item in the field errors. IIUC, the status should be degraded done when there are only warnings.

Steps to reproduce the problem

gcloud compute instances create cos113 --zone=us-west2-b --image=cos-113-18244-151-100 --image-project=cos-cloud

Environment details

cloud-init logs

...... 2024-10-21 23:07:21,561 - DataSourceGCE.py[DEBUG]: Looking for the primary NIC in: ['eth0'] 2024-10-21 23:07:21,563 - dhcp.py[DEBUG]: Skip dhclient configuration: No dhclient command found. 2024-10-21 23:07:21,563 - dhcp.py[WARNING]: DHCP client not found: dhclient 2024-10-21 23:07:21,564 - dhcp.py[WARNING]: DHCP client not found: dhcpcd 2024-10-21 23:07:21,564 - dhcp.py[DEBUG]: Skip udhcpc configuration: No udhcpc command found. 2024-10-21 23:07:21,564 - dhcp.py[WARNING]: DHCP client not found: udhcpc 2024-10-21 23:07:21,564 - DataSourceGCE.py[WARNING]: Did not find a fallback interface on gce. 2024-10-21 23:07:21,565 - handlers.py[DEBUG]: finish: init-local/search-GCELocal: FAIL: no local data found from DataSourceGCELocal 2024-10-21 23:07:21,565 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceGCE.DataSourceGCELocal'> failed 2024-10-21 23:07:21,566 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceGCE.DataSourceGCELocal'> failed Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/cloudinit/sources/init.py", line 1017, in find_source if s.update_metadata_if_supported( File "/usr/lib/python3.8/site-packages/cloudinit/sources/init.py", line 903, in update_metadata_if_supported result = self.get_data() File "/usr/lib/python3.8/site-packages/cloudinit/sources/init.py", line 438, in get_data return_value = self._check_and_get_data() File "/usr/lib/python3.8/site-packages/cloudinit/sources/init.py", line 370, in _check_and_get_data return self._get_data() File "/usr/lib/python3.8/site-packages/cloudinit/sources/DataSourceGCE.py", line 145, in _get_data if not ret["success"]: UnboundLocalError: local variable 'ret' referenced before assignment ......

output of cloud-init status --long: status: error extended_status: error boot_status_code: enabled-by-generator last_update: Mon, 21 Oct 2024 23:07:26 +0000 detail: DataSourceGCE errors: [] recoverable_errors: WARNING:

hector-gao commented 1 month ago

The error seems to be caused by status "disabled" of cloud-init services though all the services ran successfully https://github.com/canonical/cloud-init/blob/ubuntu/23.4.4-0ubuntu0_23.10.1/cloudinit/cmd/status.py#L341-L342

Then there are two follow-up question:

  1. why is the error not printed in errors of cloud-init status --long
  2. why is "enabled" or "static" required for the services? In COS, we use gentoo eclass function https://github.com/gentoo/gentoo/blob/master/eclass/systemd.eclass#L267 to enable the services. It creates symlink in the .want directory of the target service. This makes sure cloud-init services are executed but they are not recognized as "enabled" by systemctl. Also, if a program or user disables cloud-init.service after boot, cloud-init status will print error. But it doesn't necessarily indicate actual errors in cloud-init process.