canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 882 forks source link

DataSourceEC2: Doesn't support the 2009-04-04 version of the EC2 metadata properly #5711

Open NeilW opened 1 month ago

NeilW commented 1 month ago

Bug report

DataSourceEC2 supports a minimal EC2 metadata version of 2009-04-04

https://github.com/canonical/cloud-init/blob/654cb4414b29ab845e0fdad97b5beca8721844df/cloudinit/sources/DataSourceEc2.py#L79

but issues a warning due to the lack of a network key. There is no network key on that version of metadata. (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html)

This warning causes cloud-init status to exit with an exit code of 2, which fails many boot scripts.

There is no test for a 2009-04-04 version of the metadata in the cloud-init data source test scripts.

Steps to reproduce the problem

The top-level keys for the metadata can be obtained from any EC2 machine

$ curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/2009-04-04/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
instance-action
instance-id
instance-type
local-hostname
local-ipv4
placement/
profile
public-hostname
public-ipv4
public-keys/
reservation-id
security-groups

Environment details

cloud-init logs

2024-09-17 10:28:22,642 - sources[DEBUG]: Searching for local data source in: ['DataSourceEc2Local']
2024-09-17 10:28:22,642 - handlers.py[DEBUG]: start: init-local/search-Ec2Local: searching for local data from DataSourceEc2Local
2024-09-17 10:28:22,642 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2Local'>
2024-09-17 10:28:22,644 - DataSourceEc2.py[DEBUG]: Local Ec2 mode only supported on ('aws', 'outscale'), not brightbox
2024-09-17 10:28:22,644 - sources[DEBUG]: Datasource DataSourceEc2Local not updated for events: boot-new-instance
2024-09-17 10:28:22,644 - handlers.py[DEBUG]: finish: init-local/search-Ec2Local: SUCCESS: no local data found from DataSourceEc2Local
2024-09-17 10:28:25,687 - sources[DEBUG]: Searching for network data source in: ['DataSourceEc2', 'DataSourceNone']
2024-09-17 10:28:25,687 - handlers.py[DEBUG]: start: init-network/search-Ec2: searching for network data from DataSourceEc2
2024-09-17 10:28:25,687 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2'>
2024-09-17 10:28:25,688 - sources[DEBUG]: Detected platform: DataSourceEc2. Checking for active instance data
2024-09-17 10:28:25,689 - DataSourceEc2.py[DEBUG]: strict_mode: warn, cloud_name=brightbox cloud_platform=ec2
2024-09-17 10:28:25,813 - DataSourceEc2.py[DEBUG]: Removed the following from metadata urls: ['http://instance-data.:8773']
2024-09-17 10:28:25,868 - DataSourceEc2.py[DEBUG]: Using metadata source: 'http://169.254.169.254'
2024-09-17 10:28:25,877 - DataSourceEc2.py[DEBUG]: url http://169.254.169.254/2021-03-23/meta-data/instance-id raised exception 404 Client Error: Not Found for url: http://169.254.169.254/2021-03-23/meta-data/instance-id
2024-09-17 10:28:25,885 - DataSourceEc2.py[DEBUG]: url http://169.254.169.254/2018-09-24/meta-data/instance-id raised exception 404 Client Error: Not Found for url: http://169.254.169.254/2018-09-24/meta-data/instance-id
2024-09-17 10:28:25,894 - DataSourceEc2.py[DEBUG]: url http://169.254.169.254/2016-09-02/meta-data/instance-id raised exception 404 Client Error: Not Found for url: http://169.254.169.254/2016-09-02/meta-data/instance-id
2024-09-17 10:28:26,568 - handlers.py[DEBUG]: finish: init-network/search-Ec2: SUCCESS: found network data from DataSourceEc2
2024-09-17 10:28:26,568 - stages.py[INFO]: Loaded datasource DataSourceEc2 - DataSourceEc2
2024-09-17 10:28:26,616 - DataSourceEc2.py[WARNING]: Metadata 'network' key not valid: None.
2024-09-17 10:28:32,536 - util.py[DEBUG]: Cloud-init v. 24.2-0ubuntu1~24.04.2 finished at Tue, 17 Sep 2024 10:28:32 +0000. Datasource DataSourceEc2.  Up 18.53 seconds
a-dubs commented 1 month ago

Hello! Thank you for raising this bug. I was hoping you could provide a little more context on this issue and what you would like to see as a fix.

Looking at the snippet of logs you provided, it seems like 3 different EC2 metadata formats are tried until the 4th attempt succeeds. So I assume a much older metadata scheme was finally returned by the IMDS on that 4th attempt, and then cloud-init issues a warning later due the lack of a network key in the metadata. Is my understanding correct?

Did cloud-init fail to configure anything or have any regression in functionality after this issue was raised by cloud-init? Or is the issue you are looking for us to solve, solely that a warning is raised (when one shouldn't be), thus causing an undesired exit code 2?

Thanks in advance!

NeilW commented 1 month ago

The 4th attempt is the default metadata version of 2009-04-04, which the code attempts to obtain a 'network' key from. And that doesn't exist as the list of top-level keys above shows.

Cloud-init appears to do everything it is supposed to do, but some change in cloud-init is now raising exit code 2 in jammy for warnings, when it didn't in bionic. That is failing userdata scripts that rely upon waiting for cloud-init to complete successfully before continuing.

Really though it shouldn't be throwing a warning on that network key with 2009-04-04 version of metadata.

a-dubs commented 1 month ago

@NeilW Thank you for the context!

holmanb commented 1 month ago

That is failing userdata scripts that rely upon waiting for cloud-init to complete successfully before continuing.

Hi @NeilW, thanks for reporting this issue. I'm happy to help get this fixed, however I don't have access to brightbox. Are you willing and able to put together a fix for this? The current code appears to work correctly in EC2, otherwise we would see this failure in cloud-init's integration tests which check for warnings like this.

NeilW commented 1 month ago

You’ll forgive me. I couldn’t find a 2009-04-04 version of the EC2 metadata in your test suite.

Could you point me to it?

holmanb commented 1 month ago

Could you point me to it?

I don't think that we explicitly test a specific version of the EC2 metadata, but our integration test suite works more generally by launching an existing instance on a cloud, then cleaning the image (removing artifacts) and installing the latest version of cloud-init before booting it "clean". I would have to dig to understand which version is used in our tests, but I can tell you that we test EC2 daily, and a warning like this would have triggered a failing test in our verify_clean_boot() or verify_clean_log() utility functions which run on many of the EC2 tests.

@NeilW are you a brightbox developer?

NeilW commented 1 month ago

That's what I understood from the code. The integration test only exercises IMDSv2 on EC2 using the 2021-03-23 version of the metadata layout. It doesn't check the other versions of the metadata, nor IMDSv1, and runs at a different time in the boot sequence (using DatasourceEC2Local rather than DatasourceEC2)

The Unit tests only cover

https://github.com/canonical/cloud-init/blob/c62d7f22cf769cf7b293eea37813005575e24a7e/tests/unittests/sources/test_ec2.py#L46

https://github.com/canonical/cloud-init/blob/c62d7f22cf769cf7b293eea37813005575e24a7e/tests/unittests/sources/test_ec2.py#L131

https://github.com/canonical/cloud-init/blob/c62d7f22cf769cf7b293eea37813005575e24a7e/tests/unittests/sources/test_ec2.py#L280

with the 'default metadata' in the tests referring to the 2016-09-02 version.

The question then is whether the code needs to match the tests and the min_metadata_version should really by 2016-09-02?

Brightbox is intending to update the metadata version it is issuing to the 2021-03-23 version, largely to avoid the time it would take to back port any fix to Ubuntu Noble.

I do some work for Brightbox when they ask me to, and I worked with Scott on the Brightbox bits of cloud-init back in the day.

holmanb commented 1 month ago

The question then is whether the code needs to match the tests and the min_metadata_version should really by 2016-09-02?

If that is the oldest version that supports the network key, then probably yes.

Brightbox is intending to update the metadata version it is issuing to the 2021-03-23 version, largely to avoid the time it would take to back port any fix to Ubuntu Noble.

I'm guessing that this is not a new failure and was noticed due to the status 2 changes that landed in Noble?

NeilW commented 1 month ago

2016-09-02 is the oldest version that cloud-init looks and tests for. The network key itself has been available since 2011-01-01 (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html).

This failure showed up in Noble with the status 2 changes. We initially thought it was the netplan failures (Hence #5374). It was only after that was fixed we realised that cloud-init had changed the status for all Warnings.