canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.9k stars 863 forks source link

Cloud-init Hangs on System Time Change When Datasource is Unreachable #5285

Open NoneShen opened 4 months ago

NoneShen commented 4 months ago

Bug report

Cloud-init hangs indefinitely during the initialization process when trying to access an unreachable datasource (like Ec2). This issue is observed when there is a significant system time change combined with a datasource that is configured with specific timeout and max_wait settings. The process attempts to connect multiple times due to the datasource being unreachable, eventually leading to a timeout error each time.

Steps to reproduce the problem

  1. Configure the datasource settings in cloud-init to have short timeout and max_wait values as follows: datasource: Ec2: timeout: 5 max_wait: 10 CloudStack: timeout: 5 max_wait: 10 ssh_pwauth: yes

  2. Set up a virtual machine with cloud-init enabled and this configuration applied.

  3. Simulate a datasource unavailability (e.g., by using firewall rules to block access to the EC2 metadata IP 169.254.169.254).

  4. Trigger a cloud-init start with a significant system time change during the initial boot sequence.

  5. bserve that cloud-init attempts to access the datasource and fails, repeating the connection attempts until the max_wait time is exhausted.

Environment details

cloud-init logs

2024-05-11 17:33:55,725 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name 2024-05-11 17:33:55,726 - DataSourceEc2.py[DEBUG]: strict_mode: warn, cloud_name=no-ec2-metadata cloud_platform=ec2 2024-05-11 17:33:55,726 - init.py[DEBUG]: Datasource DataSourceAliYun not updated for events: boot-new-instance 2024-05-11 17:33:55,726 - handlers.py[DEBUG]: finish: init-network/search-AliYun: SUCCESS: no network data found from DataSourceAliYun 2024-05-11 17:33:55,726 - handlers.py[DEBUG]: start: init-network/search-Ec2: searching for network data from DataSourceEc2 2024-05-11 17:33:55,727 - init.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2'> 2024-05-11 17:33:55,727 - init.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance 2024-05-11 17:33:55,727 - util.py[DEBUG]: Reading from /sys/hypervisor/uuid (quiet=False) 2024-05-11 17:33:55,727 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_uuid 2024-05-11 17:33:55,728 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_serial 2024-05-11 17:33:55,728 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/chassis_asset_tag 2024-05-11 17:33:55,728 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/sys_vendor 2024-05-11 17:33:55,728 - DataSourceEc2.py[DEBUG]: strict_mode: warn, cloud_name=unknown cloud_platform=ec2 2024-05-11 17:33:58,424 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 2.696 seconds 2024-05-11 17:33:58,554 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 0.129 seconds 2024-05-11 17:33:58,555 - DataSourceEc2.py[DEBUG]: Removed the following from metadata urls: ['http://instance-data.:8773'] 2024-05-11 17:33:58,555 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 09:59:42,533 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27256/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4b45490>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 09:59:42,533 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again 2024-05-11 09:59:43,536 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 09:59:48,547 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27250/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4b58c10>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 09:59:48,548 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again 2024-05-11 09:59:49,550 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 09:59:54,562 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27243/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4acd6a0>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 09:59:54,562 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again 2024-05-11 09:59:55,564 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:00,574 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27237/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4acd520>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:00,575 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again 2024-05-11 10:00:01,577 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:06,587 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27231/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ad6130>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:06,588 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again 2024-05-11 10:00:07,589 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:12,601 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27225/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ad6b80>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:12,601 - url_helper.py[DEBUG]: Please wait 2 seconds while we wait to try again 2024-05-11 10:00:14,605 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:19,623 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27218/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ae0610>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:19,624 - url_helper.py[DEBUG]: Please wait 2 seconds while we wait to try again 2024-05-11 10:00:21,627 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:26,639 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27211/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ad6040>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:26,640 - url_helper.py[DEBUG]: Please wait 2 seconds while we wait to try again 2024-05-11 10:00:28,643 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:33,657 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27204/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4acdfa0>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:33,658 - url_helper.py[DEBUG]: Please wait 2 seconds while we wait to try again 2024-05-11 10:00:35,661 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:40,674 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27197/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ae04f0>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:40,675 - url_helper.py[DEBUG]: Please wait 2 seconds while we wait to try again 2024-05-11 10:00:42,678 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:47,691 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27190/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4aea0a0>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:47,691 - url_helper.py[DEBUG]: Please wait 3 seconds while we wait to try again 2024-05-11 10:00:50,695 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:00:55,709 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27182/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ae0fd0>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:00:55,710 - url_helper.py[DEBUG]: Please wait 3 seconds while we wait to try again 2024-05-11 10:00:58,713 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:01:03,725 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27174/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4acd910>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:01:03,726 - url_helper.py[DEBUG]: Please wait 3 seconds while we wait to try again 2024-05-11 10:01:06,730 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/2009-04-04/meta-data/instance-id' with {'url': 'http://169.254.169.254/2009-04-04/meta-data/instance-id', 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0} configuration 2024-05-11 10:01:11,743 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [-27166/10s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f54b4ad6250>, 'Connection to 169.254.169.254 timed out. (connect timeout=5.0)'))] 2024-05-11 10:01:11,743 - url_helper.py[DEBUG]: Please wait 3 seconds while we wait to try again

Additional context:

The system's time change appears to interfere with cloud-init's ability to manage its internal timers and retries, leading to prolonged hang periods without successful completion of the initialization process. As you can see, the log entries end with "request error" containes negative wait time, so that it will retry until the wait time finally goes to 10s which mease it has to wait several hours.

holmanb commented 4 months ago

Thanks for reporting this issue! I think that this was actually fixed in 9b1b2944018dc5a49, which should be available in the upcoming release 24.2.

@NoneShen could you mind testing this?

NoneShen commented 4 months ago

Thanks for reporting this issue! I think that this was actually fixed in 9b1b294, which should be available in the upcoming release 24.2.

@NoneShen could you mind testing this?

Sure, thank you for your reply.

sergey-safarov commented 3 days ago

I have tested on 24.2 and the issue is still present. This is reproduced on the Amazon cloud with an IPv6-only network. Used CentOS 9 and systemd-netword initialization. I have disabled network initialization in the cloud-init config.

Related ticket https://github.com/canonical/cloud-init/issues/5388