Closed ubuntu-server-builder closed 1 year ago
Launchpad user Alberto Donato(ack) wrote on 2021-05-18T10:16:04.877905+00:00
Does this happen during deployment or regular machine boot after deployment?
Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-05-19T07:14:41.832994+00:00
This is during regular reboots after deployment.
Launchpad user Björn Tillenius(bjornt) wrote on 2021-05-25T08:40:52.773535+00:00
I agree that we should set things up so that they rely on MAAS as little as possible after they are deployed.
We'll have to look into whether we can configure cloud-init not to talk to MAAS completely, or at least fail gracefullly.
Launchpad user Peter Sabaini(peter-sabaini) wrote on 2021-08-17T10:36:07.808270+00:00
Subscribing field-high as we're seeing this repeatedly in prod, preventing nodes from coming up
Launchpad user James Falcon(falcojr) wrote on 2021-09-07T14:28:08.959997+00:00
Can we get the cloud-init logs post reboot? "cloud-init collect-logs" (with a -u if there's no sensitive userdata involved).
Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-13T08:26:46.197986+00:00
I don't have the logs from the issue which led to the original report here. However it should be straightforward to reproduce. FTR, I tested locally on a setup using MAAS 2.8.6, and couldn't reproduce. The logs show failures to post cloud-init events, but they're not blocking. I attached the logs. Launchpad attachments: cloud-init.tar.gz
Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-13T09:29:54.899260+00:00
The reason why I couldn't reproduce is because I stopped the regionds, so cloud-init gets an immediate connection refused and moves on. To reproduce, we'd need MAAS to accept the connection and just hang indefinitely, which I can't see right now how to reproduce.
Launchpad user James Falcon(falcojr) wrote on 2021-09-14T13:34:41.383873+00:00
"we'd need MAAS to accept the connection and just hang indefinitely, which I can't see right now how to reproduce."
Are you saying that cloud-init is attempting to read metadata from MAAS, but then MAAS accepts the request but never sends a response?
Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-15T11:38:52.864674+00:00
"Are you saying that cloud-init is attempting to read metadata from MAAS, but then MAAS accepts the request but never sends a response?"
Yes. We also observed that if the packets are DROPed between the node and MAAS, then the boot process will be stuck as well. REJECTing the packets allows the connection to be refused so the boot sequence immediately moves on. (this happened to us on a node whose MAAS had been decommissioned, and corresponding firewall rules removed. Adding a rule to REJECT the packets had unblocked the boot sequence).
I could reproduce in a local test environment, DROPing packets along the way. I attached the logs. (The logs should show that after a few minutes, the node succeeds contacting MAAS. That's because I removed the rule at that point.). Launchpad attachments: cloud-init.tar.gz
Launchpad user James Falcon(falcojr) wrote on 2021-09-15T14:18:35.704033+00:00
Hey Laurent. I looked over the logs, and I'm not sure where the issue occurred. There are a number of large time gaps, but they seem to be between invocations of cloud-init. Given how long the log is, can you help me pinpoint where the issue occurred?
Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-09-16T08:53:52.191624+00:00
The issue is visible when the machine boots on 2021-09-15 09:49. cloud-init fails to communicate with MAAS until 10:06:20, which is when I removed the DROP rule.
Launchpad user James Falcon(falcojr) wrote on 2021-09-16T13:33:56.364388+00:00
Thanks. It looks like the issue is that a reporting URL has been setup, so logs are setup to be posted back to MAAS. This happens synchronously, so even though each request has a 20 second timeout, when there are dozens/hundreds of logs to send, each timeout adds up to look like cloud-init has stalled.
Launchpad user Jerzy Husakowski(jhusakowski) wrote on 2022-03-24T09:52:37.113051+00:00
We want to change configuration of cloud-init to prevent attempts to contact MAAS after the first boot.
Launchpad user James Falcon(falcojr) wrote on 2022-06-27T21:30:20.508721+00:00
Launchpad user Brett Holman(holmanb) wrote on 2022-08-19T16:37:06.595213+00:00
This bug is believed to be fixed in cloud-init in version 22.3. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
Launchpad user Adam Collard(adam-collard) wrote on 2023-04-21T10:38:11.804278+00:00
Deployed a machine using an alpha build of MAAS 3.4.0 with cloud-init 23.1.1-0ubuntu0~20.04.1
From initial deployment
ubuntu@petrel:~$ cloud-init analyze boot -- Most Recent Boot Record -- Kernel Started at: 2023-04-21 10:21:24.408999 Kernel ended boot at: 2023-04-21 10:21:28.556996 Kernel time to boot (seconds): 4.14799690246582 Cloud-init activated by systemd at: 2023-04-21 10:21:33.759920 Time between Kernel end boot and Cloud-init activation (seconds): 5.202924013137817 Cloud-init start: 2023-04-21 10:21:36.521000 successful
Then rebooted with MAAS still available
ubuntu@petrel:~$ cloud-init analyze boot -- Most Recent Boot Record -- Kernel Started at: 2023-04-21 10:29:39.458092 Kernel ended boot at: 2023-04-21 10:29:43.598001 Kernel time to boot (seconds): 4.139909029006958 Cloud-init activated by systemd at: 2023-04-21 10:29:48.546367 Time between Kernel end boot and Cloud-init activation (seconds): 4.948365926742554 Cloud-init start: 2023-04-21 10:29:51.154000 successful
Then turned off MAAS, and rebooted a second time
ubuntu@petrel:~$ cloud-init analyze boot -- Most Recent Boot Record -- Kernel Started at: 2023-04-21 10:34:03.435137 Kernel ended boot at: 2023-04-21 10:34:07.571495 Kernel time to boot (seconds): 4.136358022689819 Cloud-init activated by systemd at: 2023-04-21 10:34:12.638482 Time between Kernel end boot and Cloud-init activation (seconds): 5.066987037658691 Cloud-init start: 2023-04-21 10:34:15.266000 successful
and confirmed in the cloud-init logs that we saw the expected
2023-04-21 10:34:27,759 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.
Launchpad user Adam Collard(adam-collard) wrote on 2023-04-21T11:19:41.880839+00:00
For the docs please note this as a fixed bug
Launchpad user Junien Fridrick(axino) wrote on 2023-04-24T06:42:55.002377+00:00
@adam-collard hello ! Thanks for progressing this issue.
I believe stopping MAAS isn't a good way to verify that this issue is fixed, see comments 7 to 9 as to why. Basically, we experience this bug in two different scenarios :
a) packets to the MAAS servers are iptables-DROPed. Adding a REJECT rules works around the problem.
b) MAAS accepts the TCP connection, but does nothing with it (because it's spinning on CPU for some reason). Restarting (or stopping) MAAS works around the problem.
This bug was originally filed in Launchpad as LP: #1910552
Launchpad details
Launchpad user Laurent Sesquès(sajoupa) wrote on 2021-01-07T14:39:52.160666+00:00
We have a recurring issue on a MAAS 2.3.7 (xenial), where once in a while we need to restart rackd and regiond to make maas respond to machines rebooting. This itself would be a different bug though. What I'd like to report here is that a machine should be able to finish its boot sequence even if it can't talk to the MAAS API.
Observed behaviour:
[ OK ] Started Raise network interfaces. [ OK ] Reached target Network. Starting Initial cloud-init job (metadata service crawler)... (stuck here indefinitely)
(restart rackd and regiond)
the machine reboots successfully.