canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.96k stars 879 forks source link

DataSourceOpenstack: Retry when waiting for the metadata service too #3989

Open ubuntu-server-builder opened 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1979049

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2022-06-20T18:46:41.398407+00:00
date_created = 2022-06-17T09:59:38.781185+00:00
date_fix_committed = None
date_fix_released = None
id = 1979049
importance = wishlist
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1979049
milestone = None
owner = dcaro
owner_name = David Caro
private = False
status = opinion
submitter = dcaro
submitter_name = David Caro
tags = []
duplicates = []

Launchpad user David Caro(dcaro) wrote on 2022-06-17T09:59:38.781185+00:00

We are having some instability in our Openstack Wallaby systems right now, and have found that even though passing the "retries" option cloud-init openstack datasource, the code might fail also on a first attempt to detect alive metadata services as it's not retrying at that point (only when trying to fetch the data itself).

Some logs of the error:

2022-06-16 03:25:35,449 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/openstack' with {'url': 'http://169.254.169.254/openstack', 'allow_redirects': True, 'method': 'GET', 'timeou
t': 10.0, 'headers': {'User-Agent': 'Cloud-Init/20.4.1'}} configuration
2022-06-16 03:25:45,469 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [10/-1s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out.
(read timeout=10.0)]
2022-06-16 03:25:45,471 - DataSourceOpenStack.py[DEBUG]: Giving up on OpenStack md from ['http://169.254.169.254/openstack'] after 10 seconds
2022-06-16 03:25:45,471 - util.py[WARNING]: No active metadata service found
2022-06-16 03:25:45,477 - util.py[DEBUG]: No active metadata service found
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 145, in _get_data
    results = self._crawl_metadata()
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOpenStack.py", line 181, in _crawl_metadata
    raise sources.InvalidMetaDataException(
cloudinit.sources.InvalidMetaDataException: No active metadata service found

We think that it should retry to fetch also on this first pass (we are looking also on the instability sources, but this would help make it more resilient).

pshchelo commented 1 year ago

We met something similar and what I think is happening is OVN :-)

In OpenStack Neutron with ML2/OVS backend there's this whole mechanism of 'provisioning blocks', so by the time VM is started you have guarantee that DHCP for the instance is ready - and this is why it worked for so many years w/o retries here. With OVN instead of OVS, afaiu there's no such guarantee any more, so technically VM can be started before the DHCP is ready. Thus we should add max_wait for the OpenStack data source as well, e.g. similar to https://github.com/canonical/cloud-init/blob/be7f64d7615cc3f2f14b6e998a577536464a2524/cloudinit/sources/DataSourceCloudStack.py#L81