canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
3k stars 885 forks source link

cloud-init upgrade causes vultr init networking to fail. #5092

Open pnearing opened 8 months ago

pnearing commented 8 months ago

On upgrading an Ubuntu Mantic server on Vultr I started getting an error on boot:

2024-03-23 15:02:27,710 - url_helper.py[WARNING]: Calling 'None' failed [119/120s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /2 009-04-04/meta-data/instance-id (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7aefa0738690>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]

And now this message appears on login:


This system is using the EC2 Metadata Service, but does not appear to
be running on Amazon EC2 or one of cloud-init's known platforms that
provide a EC2 Metadata service. In the future, cloud-init may stop
reading metadata from the EC2 Metadata Service unless the platform can be identified.

If you are seeing this message, please file a bug against
cloud-init at
https://github.com/canonical/cloud-init/issues
Make sure to include the cloud provider your instance is
running on.

For more information see
https://github.com/canonical/cloud-init/issues/2795

After you have filed a bug, you can disable this warning by
launching your instance with the cloud-config below, or
putting that content into
/etc/cloud/cloud.cfg.d/99-ec2-datasource.cfg

cloud-config
datasource:
Ec2:
strict_id: false


Disable the warnings above by: touch /root/.cloud-warnings.skip or touch /var/lib/cloud/instance/warnings/.skip

Any more information you might require please let me know.

blackboxsw commented 8 months ago

Can you please perform a cloud-init collect-logs on the system and attach the .tgz to this bug to give us a bit more information. Also are there any other files in /etc/cloud/cloud.cfg at play here?

blackboxsw commented 8 months ago

Generally speaking, we'd expect Vultr datasource to be discovered here if we are using Latest cloud-init on Mantic. So, I'm presuming we have an issue earlier in logs that lead to Ec2 being detected instead of Vultr. The cloud-init collect-logs requested above will hopefully give us all the information we need to discover how this instance managed to not detect Vultr and fallback to Ec2. the logs that are of most interest here (which will be included in that tar file) are /run/cloud-init/ds-identify.log (initial datasource detection) and /var/log/cloud-init.log (which will potentially show us why Vultr wasn't detected). cloud-init status --format=json may also show you known errors quickly.

blackboxsw commented 8 months ago

Yes something else is going on here besides just upgrade path. I launched a Vultr mantic 23.10 instance w/ cloud-init 23.3.3 and upgraded to latest cloud-init 23.4.4 and rebooted with no issues in Vultr datasource detection.

I did recognize a known small bug dealing a warning about scripts/vendor which has already landed in https://github.com/canonical/cloud-init/pull/4986. But that issue would not have cause Vultr datasource to go undiscovered.

root@test-mantic:~# cloud-init --version
/usr/bin/cloud-init 23.4.4-0ubuntu0~23.10.1
root@test-mantic:~# cloud-id
vultr
blackboxsw commented 8 months ago

CC: @eb3095 just FYI as I don't see a problem at the moment, but we'll wait on logs.

pnearing commented 8 months ago

Please find attached the logs, as well, I've not changed any config in /etc/cloud. cloud-init.tar.gz

blackboxsw commented 8 months ago

Thanks a lot for the logs @pnearing. Near as I can tell something between 03/09 and the reboot after 03/23 somehow altered the list of datasources that cloud-init tried to discover on this system from [ Vultr, None] to the full list of all potential datasources 2024-03-23 13:31:15,412 - __init__.py[DEBUG]: Looking for data source in: ['NoCloud', 'ConfigDrive', 'OpenNebula', 'DigitalOcean', 'Azure', 'AltCloud', 'OVF', 'MAAS', 'GCE', 'OpenStack', 'CloudSigma', 'SmartOS', 'Bigstep', 'Scaleway', 'AliYun', 'Ec2', 'CloudStack', 'Hetzner', 'IBMCloud', 'Oracle', 'Exoscale', 'RbxCloud', 'UpCloud', 'VMware', 'Vultr', 'LXD', 'NWCS', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']

Normally /usr/lib/cloud-init/ds-identify would filter this list of datasources to only what could be viable, but there is configuration on this Vultr instance that is setting manual_cache_clean: true which prevents ds-identify from trying to filter this list of datasources in systemd generator timeframe. You can see that breadcrumb comment in /run/cloud-init/ds-identify.log manual_cache_clean enabled. Not writing datasource_list. This prevents ds-itentify from writing out /run/cloud-init/cloud.cfg with a limited datasource_list: [ Vultr, None ] set of values. Therefore you now see in latest cloud-init logs that cloud-init spends a long time trying to detect a whole bunch on inapplicable datasources, and that lovely error banner message telling you to file a bug.

Normally I expect to see /etc/cloud/cloud.cfg with a datasource_list: ['Vultr'] limited datasource_list. Did something change in /etc/cloud or /etc/cloud/cloud.cfg.d/*cfg files to change this default setting? This ultimately is likely the problem we are seeing here.

, the file /var/lib/cloud/instance/manual-clean exists on this machine or manual_cache_clean: true was set in configuration or user-data. When manual-clean marker exists. ds-it

TheRealFalcon commented 8 months ago

Did something change in /etc/cloud or /etc/cloud/cloud.cfg.d/*cfg files to change this default setting? This ultimately is likely the problem we are seeing here.

No config change needed. The python version changed (presumably on upgrade) causing the cache to clear.

eb3095 commented 8 months ago

Yeah this sounds like an extension of one of the issues I was dealing with in IRC. manual_cache_clean: true was added, I had mentioned we were doing this, because we were seeing issues where if something broke with our host networking or dhcp would fail it would cause the server to re-init as nocloud or whatever that default was cycling all the keys and the root user which was a terrible UX for our users and caused it to re-init yet again when networking came back. That was the only solution we were able to find to prevent this behavior, but we didnt make that default option, we change it in our own provided cloud.cfg in our images.

I'de be happy to reopen this issue and find a more amicable solution so we did not need to do that.

blackboxsw commented 8 months ago

@eb3095 I did see the provided /etc/cloud/cloud.cfg in your images which does limit datasource_list: [ Vultr, None ]. I think if you are generating images, and packaging files delivered to /etc/cloud for configuration, you may want to write the datasource_list configuration to a file like /etc/cloud/cloud.cfg.d/95-ds-vultr.cfg containing:

datasource_list: [ Vultr, None]

The reason being that cloud-init upstream(and dpkg-reconfigure cloud-init) will write /etc/cloud/cloud.cfg.d/90_dpkg.cfg which overrides the default datasource_list to the potentially long list of all datasources that we see above causing tracebacks and errors because Ec2 datasource will get detected on Vultr platforms if Ec2 is before Vultr in the datasource_list. Whatever /etc/cloud/cloud.cfg.d file you choose it'll need to be lexicographically sorted later above 90_dpkg

eb3095 commented 8 months ago

Fantastic, I will get right on that. Thanks for the advice.