canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.98k stars 880 forks source link

commissioning fails silently if a node can't reach the region controller #2439

Open ubuntu-server-builder opened 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1303925

Launchpad details
affected_projects = ['maas']
assignee = None
assignee_name = None
date_closed = None
date_created = 2014-04-07T17:39:11.416336+00:00
date_fix_committed = None
date_fix_released = None
id = 1303925
importance = low
is_complete = False
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1303925
milestone = None
owner = julian-edwards
owner_name = Julian Edwards
private = False
status = triaged
submitter = elmo
submitter_name = James Troup
tags = ['canonical-is', 'provisioning', 'robustness']
duplicates = []

Launchpad user James Troup(elmo) wrote on 2014-04-07T17:39:11.416336+00:00

We recently had a node which completely refused to commission in MAAS. After (literally) several man days of debugging, we figured out that it was because the node couldn't talk to the region controller over HTTP.

Obviously, that's ultimately our mistake/problem, but MAAS could have been a lot better at helping us to help ourselves; currently, there's absolutely no indication from the boot process that the HTTP connection to the region controller is the problem.

Attached is the serial console output (from the point of boot) for the node that was failing to commission. 91.189.94.35 is the MAAS region controller and 91.189.88.20 is the MAAS cluster controller.

ubuntu-server-builder commented 1 year ago

Launchpad user James Troup(elmo) wrote on 2014-04-07T17:39:11.416336+00:00

Launchpad attachments: Console output from node that wouldn't commission

ubuntu-server-builder commented 1 year ago

Launchpad user Graham Binns(gmb) wrote on 2014-04-07T18:01:45.911909+00:00

Calling this critical since it's a costly failure state to get into, and targeting it for 14.10.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-08T05:49:30.186676+00:00

James, was it hanging or shutting down after that error in the log?

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-08T05:50:49.455195+00:00

MAAS is not in direct control at this point, I think cloud-init needs to do better here and have a last-ditch catch of exceptions before running a piece of failsafe code that would report something back to MAAS.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-04-08T13:25:00.099366+00:00

cloud-init is executing code that maas told it to execute. so maas needs to tell it to execute code that has some "last ditch catch".

to be clear, cloud-init got data from maas (via kernel cmdline) that told it to tell get some code from the metadata server to execute. It then executed it. That code failed. that is the code that needs to be more resilient. cloud-init is, by design, very much doing exactly what maas tells it to do.

ubuntu-server-builder commented 1 year ago

Launchpad user Gavin Panella(allenap) wrote on 2014-04-08T14:08:50.698879+00:00

I assume cloud-init doesn't crash if the code it downloads from MAAS breaks... so the reason it's hanging is because the instructions about what to do next were in that downloaded, crashy, piece? If so, I further assume that we therefore need to get some fail-safe command into the first user-data file that cloud-init processes; is that right?

ubuntu-server-builder commented 1 year ago

Launchpad user Gavin Panella(allenap) wrote on 2014-04-08T14:11:05.289913+00:00

I made a mistake: it couldn't actually download the code. However, the question stands: what does cloud-init do if it can't download from a data source? Does it process the next directive in the user-data that it does have?

ubuntu-server-builder commented 1 year ago

Launchpad user James Troup(elmo) wrote on 2014-04-09T00:16:30+00:00

Julian Edwards 1303925@bugs.launchpad.net writes:

James, was it hanging or shutting down after that error in the log?

It hung.

-- James

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-09T01:49:00.330667+00:00

Thanks James.

Scott, cloud-init is hanging without getting any data from MAAS. It seems to me that there should be at least a last-ditch way of reporting the failure back somewhere?

This is possibly a dupe of bug 1237215

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-09T02:52:36+00:00

On 09/04/14 00:11, Gavin Panella wrote:

I made a mistake: it couldn't actually download the code. However, the question stands: what does cloud-init do if it can't download from a data source? Does it process the next directive in the user-data that it does have?

There is no user data at this point though is there? It's trying to get it from the metadata server, as MAAS just passes the URL to that on the kernel cmd line.

I don't know if there's room on the kernel command line to add much more. If there's something simple MAAS can do here then great, but I'm concerned that cloud-init hangs.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-04-09T13:40:32.799268+00:00

[ 0.000000] Command line: nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010 iscsi_target_ip=91.189.88.20 iscsi_target_port=3260 iscsi_initiator=rubay ip=::::rubay:BOOTIF ro root=/dev/disk/by-path/ip-91.189.88.20:3260-iscsi-iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010-lun-1 overlayroot=tmpfs cloud-config-url=http://91.189.94.35/MAAS/metadata/latest/by-id/node-0d287828-be5e-11e3-a0d3-0019bbccd75c/?op=get_preseed log_host=91.189.94.35 log_port=514 console=tty0 console=ttyS1,38400 nosplash initrd=amd64/generic/precise/commissioning/initrd.gz BOOT_IMAGE=amd64/generic/precise/commissioning/linux BOOTIF=01-2c-44-fd-81-23-e8

I did wrongly diagnose this previously. cloud-init could / should warn more loudly that it couldn't get the url for 'cloud-config-url'.

However, here is what happened as I understand it:

  1. maas cluster controller sent the above kernel command line to a commissioning node (enlistment or commissioning wouldn't really matter, the ephemeral environment is the key).
  2. node was unable to reach the maas region controller at 91.189.94.35 In a happy path, cloud-init would have gotten that url, and stored it in /etc/cloud/cloud.cfg.d/ . The content of that url would have then told cloud-init that it should: a.) only enable the maas datasource (disabling the ec2 datasource) b.) attempt to get data from the maas datasource on the region controller.

3.) cloud-init failed to get any configuration on the kernel cmdline, so it went on its way looking for all configured datasources, which included the EC2 datasource. Note, that the timeout on the EC2 datasource is quite annoying, but was at least historically required as the EC2 datasource might just not have been there for some time, so polling and retry was necessary. Anyway, that wouldnt' have changed anything, the failure path inevitable given '2' above.

cloud-init probably should have cried more loudly when the request in '2' failed. It is possible that even if it did do that, such a warn would have been lost due to other bugs like bug 1235231. But it should at least WARN, and i'll make sure it does that.

To me the most general problem here is the requirement for a node's boot to contact the region controller, and the lack of documentation of that requiment (or failure of the user to know that, I'm not sure whether or not it is documented).

at http://91.189.94.35/MAAS/metadata/... early in its process, cloud-init probably tried and failed to get that url.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-09T23:52:32.057632+00:00

Thanks for the analysis Scott, I concur.

MAAS in general at the moment is a "fire and forget" model which is pretty naive, and we're going to work on making this stuff more robust in the coming weeks.

It seems that cloud-init could help a little if we could provide some other way, via the kernel params, of a "failure" API point which it could do a POST on (with data about the failure) if there is any problem. Is this something you'd consider implementing in cloud-init?

ubuntu-server-builder commented 1 year ago

Launchpad user Mark Shuttleworth(sabdfl) wrote on 2014-04-10T21:21:32.091735+00:00

Let's rather think about how MAAS itself could make this feel more like a managed experience.

All of the above could be held by the cluster controller and only fed to the region controller on demand (i.e. when debugging). That avoids DOS'ing the region controller when PXE-booting the DC.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-11T03:52:46.427662+00:00

Mark, this is pretty much the plan and I've already asked Gavin to look into the changes required to support a more granular boot reporting mechanism like this.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-04-11T17:10:13.310955+00:00

regarding failure post path, we could look at that. it really seems like overloading the kernel cmdline though.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2014-04-22T00:10:43+00:00

On Friday 11 Apr 2014 17:10:13 you wrote:

regarding failure post path, we could look at that. it really seems like overloading the kernel cmdline though.

Well ultimately MAAS will time out the node and try elsewhere, but if the node is able to pre-empt this while providing valuable debug info, then it's worth it.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-08-25T17:00:51.431810+00:00

I'm marking this triaged for cloud-init. At least 1 solution is understood.