canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 881 forks source link

machines fail to boot if MAAS doesn't respond to cloud-init #3824

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1910552

Launchpad details
affected_projects = ['maas', 'maas-doc']
assignee = None
assignee_name = None
date_closed = 2022-08-19T16:37:05.525027+00:00
date_created = 2021-01-07T14:39:52.160666+00:00
date_fix_committed = 2022-06-27T21:30:13.939471+00:00
date_fix_released = 2022-08-19T16:37:05.525027+00:00
id = 1910552
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1910552
milestone = None
owner = adam-collard
owner_name = Adam Collard
private = False
status = fix_released
submitter = sajoupa
submitter_name = Laurent Sesquès
tags = []
duplicates = [1925345, 1951995]

Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-01-07T14:39:52.160666+00:00

We have a recurring issue on a MAAS 2.3.7 (xenial), where once in a while we need to restart rackd and regiond to make maas respond to machines rebooting. This itself would be a different bug though. What I'd like to report here is that a machine should be able to finish its boot sequence even if it can't talk to the MAAS API.

Observed behaviour:

[ OK ] Started Raise network interfaces. [ OK ] Reached target Network. Starting Initial cloud-init job (metadata service crawler)... (stuck here indefinitely)

(restart rackd and regiond)

the machine reboots successfully.

ubuntu-server-builder commented 1 year ago

Launchpad user Alberto Donato(ack) wrote on 2021-05-18T10:16:04.877905+00:00

Does this happen during deployment or regular machine boot after deployment?

ubuntu-server-builder commented 1 year ago

Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-05-19T07:14:41.832994+00:00

This is during regular reboots after deployment.

ubuntu-server-builder commented 1 year ago

Launchpad user Björn Tillenius(bjornt) wrote on 2021-05-25T08:40:52.773535+00:00

I agree that we should set things up so that they rely on MAAS as little as possible after they are deployed.

We'll have to look into whether we can configure cloud-init not to talk to MAAS completely, or at least fail gracefullly.

ubuntu-server-builder commented 1 year ago

Launchpad user Peter Sabaini(peter-sabaini) wrote on 2021-08-17T10:36:07.808270+00:00

Subscribing field-high as we're seeing this repeatedly in prod, preventing nodes from coming up

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2021-09-07T14:28:08.959997+00:00

Can we get the cloud-init logs post reboot? "cloud-init collect-logs" (with a -u if there's no sensitive userdata involved).

ubuntu-server-builder commented 1 year ago

Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-13T08:26:46.197986+00:00

I don't have the logs from the issue which led to the original report here. However it should be straightforward to reproduce. FTR, I tested locally on a setup using MAAS 2.8.6, and couldn't reproduce. The logs show failures to post cloud-init events, but they're not blocking. I attached the logs. Launchpad attachments: cloud-init.tar.gz

ubuntu-server-builder commented 1 year ago

Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-13T09:29:54.899260+00:00

The reason why I couldn't reproduce is because I stopped the regionds, so cloud-init gets an immediate connection refused and moves on. To reproduce, we'd need MAAS to accept the connection and just hang indefinitely, which I can't see right now how to reproduce.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2021-09-14T13:34:41.383873+00:00

"we'd need MAAS to accept the connection and just hang indefinitely, which I can't see right now how to reproduce."

Are you saying that cloud-init is attempting to read metadata from MAAS, but then MAAS accepts the request but never sends a response?

ubuntu-server-builder commented 1 year ago

Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-15T11:38:52.864674+00:00

"Are you saying that cloud-init is attempting to read metadata from MAAS, but then MAAS accepts the request but never sends a response?"

Yes. We also observed that if the packets are DROPed between the node and MAAS, then the boot process will be stuck as well. REJECTing the packets allows the connection to be refused so the boot sequence immediately moves on. (this happened to us on a node whose MAAS had been decommissioned, and corresponding firewall rules removed. Adding a rule to REJECT the packets had unblocked the boot sequence).

I could reproduce in a local test environment, DROPing packets along the way. I attached the logs. (The logs should show that after a few minutes, the node succeeds contacting MAAS. That's because I removed the rule at that point.). Launchpad attachments: cloud-init.tar.gz

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2021-09-15T14:18:35.704033+00:00

Hey Laurent. I looked over the logs, and I'm not sure where the issue occurred. There are a number of large time gaps, but they seem to be between invocations of cloud-init. Given how long the log is, can you help me pinpoint where the issue occurred?

ubuntu-server-builder commented 1 year ago

Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-16T08:53:52.191624+00:00

The issue is visible when the machine boots on 2021-09-15 09:49. cloud-init fails to communicate with MAAS until 10:06:20, which is when I removed the DROP rule.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2021-09-16T13:33:56.364388+00:00

Thanks. It looks like the issue is that a reporting URL has been setup, so logs are setup to be posted back to MAAS. This happens synchronously, so even though each request has a 20 second timeout, when there are dozens/hundreds of logs to send, each timeout adds up to look like cloud-init has stalled.

ubuntu-server-builder commented 1 year ago

Launchpad user Jerzy Husakowski(jhusakowski) wrote on 2022-03-24T09:52:37.113051+00:00

We want to change configuration of cloud-init to prevent attempts to contact MAAS after the first boot.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2022-06-27T21:30:20.508721+00:00

See https://github.com/canonical/cloud-init/pull/1491

ubuntu-server-builder commented 1 year ago

Launchpad user Brett Holman(holmanb) wrote on 2022-08-19T16:37:06.595213+00:00

This bug is believed to be fixed in cloud-init in version 22.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Collard(adam-collard) wrote on 2023-04-21T10:38:11.804278+00:00

Deployed a machine using an alpha build of MAAS 3.4.0 with cloud-init 23.1.1-0ubuntu0~20.04.1

From initial deployment

ubuntu@petrel:~$ cloud-init analyze boot -- Most Recent Boot Record -- Kernel Started at: 2023-04-21 10:21:24.408999 Kernel ended boot at: 2023-04-21 10:21:28.556996 Kernel time to boot (seconds): 4.14799690246582 Cloud-init activated by systemd at: 2023-04-21 10:21:33.759920 Time between Kernel end boot and Cloud-init activation (seconds): 5.202924013137817 Cloud-init start: 2023-04-21 10:21:36.521000 successful

Then rebooted with MAAS still available

ubuntu@petrel:~$ cloud-init analyze boot -- Most Recent Boot Record -- Kernel Started at: 2023-04-21 10:29:39.458092 Kernel ended boot at: 2023-04-21 10:29:43.598001 Kernel time to boot (seconds): 4.139909029006958 Cloud-init activated by systemd at: 2023-04-21 10:29:48.546367 Time between Kernel end boot and Cloud-init activation (seconds): 4.948365926742554 Cloud-init start: 2023-04-21 10:29:51.154000 successful

Then turned off MAAS, and rebooted a second time

ubuntu@petrel:~$ cloud-init analyze boot -- Most Recent Boot Record -- Kernel Started at: 2023-04-21 10:34:03.435137 Kernel ended boot at: 2023-04-21 10:34:07.571495 Kernel time to boot (seconds): 4.136358022689819 Cloud-init activated by systemd at: 2023-04-21 10:34:12.638482 Time between Kernel end boot and Cloud-init activation (seconds): 5.066987037658691 Cloud-init start: 2023-04-21 10:34:15.266000 successful

and confirmed in the cloud-init logs that we saw the expected

2023-04-21 10:34:27,759 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Collard(adam-collard) wrote on 2023-04-21T11:19:41.880839+00:00

For the docs please note this as a fixed bug

ubuntu-server-builder commented 1 year ago

Launchpad user Junien Fridrick(axino) wrote on 2023-04-24T06:42:55.002377+00:00

@adam-collard hello ! Thanks for progressing this issue.

I believe stopping MAAS isn't a good way to verify that this issue is fixed, see comments 7 to 9 as to why. Basically, we experience this bug in two different scenarios :

a) packets to the MAAS servers are iptables-DROPed. Adding a REJECT rules works around the problem.

b) MAAS accepts the TCP connection, but does nothing with it (because it's spinning on CPU for some reason). Restarting (or stopping) MAAS works around the problem.