canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 881 forks source link

[b1] pod created vm fails commissioning after getting 404 from metadata api #3097

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1742971

Launchpad details
affected_projects = ['maas', 'maas/2.3']
assignee = None
assignee_name = None
date_closed = 2018-04-05T22:48:19.018350+00:00
date_created = 2018-01-12T15:54:49.872103+00:00
date_fix_committed = None
date_fix_released = None
id = 1742971
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1742971
milestone = None
owner = jason-hobbs
owner_name = Jason Hobbs
private = False
status = invalid
submitter = jason-hobbs
submitter_name = Jason Hobbs
tags = ['cdo-qa', 'cdo-qa-blocker', 'foundations-engine', 'track']
duplicates = []

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-01-12T15:54:49.872103+00:00

A vm created via the pods API in maas failed to commission immediately after it was created.

It PXE booted, got initrd and the kernel, dhcp'd again, and then wasn't heard from anymore:

http://paste.ubuntu.com/26372429/

There are no rsyslog logs for it. The hostname of the vm is landscapeamqp-1.maas.

If I connect to the node with virt-viewer, it is sitting at an Ubuntu prompt, but I can't login because there are no passwords set. There are no console logs available on disk (bug 1742971).

This is with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1) in an HA setup - logs from the MAAS servers are available in the attached infra-logs.tar.

To check for this bug you can look for 404 errors like this: http://paste.ubuntu.com/p/hsbw22BFps/

This was using the default daily maas images.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-01-12T15:54:49.872103+00:00

Launchpad attachments: infra-logs.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-01-12T16:29:07.084364+00:00

@Jason,

As per our IRC chat, while there are no console logs stored on the host, this doesn't prevent you from obtaining the logs by attempting to recommission:

  1. Attempt to get the kernel params being sent, you will see those on the VM's KVM. (Try re-commissioning to obtain these).

  2. Enable console log as kernel params in the specific VM, and attempt to re-commission.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-01-17T14:15:53.679904+00:00

I hit this again today. This is a race condition - it isn't reproducible by re-commissioning the failing VM. I did that, and commissioning succeeded that time. I still grabbed the kernel command line, but I don't think it's very useful. http://paste.ubuntu.com/26404667/

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-19T20:34:26.357410+00:00

We reproduced this some last week, and I noticed that we're getting 404's when the failing node tries to retrieve its preseed:

10.244.40.31/var/log/maas/regiond.log:2018-03-15 11:28:33 regiond: [info] 10.244.40.32 GET /metadata/latest/by-id/cwerrd/?op=get_preseed HTTP/1.1 --> 404 NOT_FOUND (referrer: -; agent: Cloud-Init/17.2)

The problem there is the /MAAS prefix is missing from the path; here's a successful request from another VM: 10.244.40.32/var/log/maas/regiond.log:2018-03-15 11:29:04 regiond: [info] 10.244.40.32 GET /MAAS/metadata/latest/by-id/w3hg84/?op=get_preseed HTTP/1.1 --> 200 OK (referrer: -; agent: Cloud-Init/17.2)

notice the /MAAS in the URL on the successful one.

Now, the question is why the /MAAS is missing in the failure case.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-19T20:48:37.861146+00:00

infra logs from the failure in the last comment. Launchpad attachments: infra-logs.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Chris Gregan(cgregan) wrote on 2018-03-26T19:47:39.302779+00:00

Upgraded to field high given the number of times we have seen this issue

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-03-26T19:49:30.248545+00:00

Marking this as incomplete in MAAS as Jason was to provide more logs when the next occurrence of this happens.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-26T20:02:04.862784+00:00

I'm sorry, which logs was I supposed to supply?

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-26T20:25:48+00:00

Ahh right, we turned on an extra maaslog around the kernel parameters.

On Mon, Mar 26, 2018 at 3:02 PM, Jason Hobbs jason.hobbs@canonical.com wrote:

I'm sorry, which logs was I supposed to supply?

** Changed in: maas Status: Incomplete => New

-- You received this bug notification because you are subscribed to the bug report. https://bugs.launchpad.net/bugs/1742971

Title: [b1] pod created vm fails commissioning after getting 404 from metadata api

Status in cloud-init: New Status in MAAS: New

Bug description: A vm created via the pods API in maas failed to commission immediately after it was created.

It PXE booted, got initrd and the kernel, dhcp'd again, and then wasn't heard from anymore:

http://paste.ubuntu.com/26372429/

There are no rsyslog logs for it. The hostname of the vm is landscapeamqp-1.maas.

If I connect to the node with virt-viewer, it is sitting at an Ubuntu prompt, but I can't login because there are no passwords set. There are no console logs available on disk (bug 1742971).

This is with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1) in an HA setup - logs from the MAAS servers are available in the attached infra-logs.tar.

To check for this bug you can look for 404 errors like this: http://paste.ubuntu.com/p/hsbw22BFps/

This was using the default daily maas images.

To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1742971/+subscriptions

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-28T16:03:32.128179+00:00

I captured the requested log information, and it shows maas is sending the incorrect URL to the node in the kernel parameters:

10.244.40.30/var/log/maas/maas.log:Mar 28 11:59:18 leafeon maas.kernel_opts: message repeated 5 times: [ [info] ---: kernel parameters landscapeha-1 "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/generic/xenial/daily/squashfs ro ip=::::landscapeha-1:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/metadata/latest/by-id/b4hrca/?op=get_preseed apparmor=0 log_host=10.244.40.33 log_port=514"]

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-28T17:04:34.662835+00:00

Launchpad attachments: infra-logs (1).tar

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-03-28T18:32:29.454159+00:00

From the machine running region/rack 10.244.40.30, we see the following:

  1. /etc/maas/rackd.conf:maas_url: http://10.244.40.33/MAAS
  2. /etc/maas/regiond.conf:maas_url: http://10.244.40.33/MAAS

The logs show this only for 1 machine, which correctly tells the machine to download the squashfs from the rack controller, and gives it the correct IP of the metadata URL, but removes the /MAAS for some reason, which effectively makes it invalid.

Mar 28 11:59:18 leafeon maas.kernel_opts: [info] ---: kernel parameters landscapeha-1 "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/generic/xenial/daily/squashfs ro ip=::::landscapeha-1:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/metadata/latest/by-id/b4hrca/?op=get_preseed apparmor=0 log_host=10.244.40.33 log_port=514"

On the other hand, there are quite a few other logs that show that the URL is correct, with the same localtion for the images, and the correct URL for the metadata.

Mar 28 12:02:50 leafeon maas.kernel_opts: [info] ---: kernel parameters maas-enlist "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/ga-16.04/xenial/daily/squashfs ro ip=::::maas-enlist:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed apparmor=0 log_host=10.244.40.33 log_port=514"

So somewhere in the code the /MAAS is being removed and we need to find out why.

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-04-10T21:54:37.684744+00:00

In this case it doesn't include the port:

cloud-config-url=http://10.244.40.33/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed

ubuntu-server-builder commented 1 year ago

Launchpad user Chris Gregan(cgregan) wrote on 2018-04-12T10:06:45.725609+00:00

This bug requires a task for 2.3.x as well. SLAs do not apply to unreleased versions and besides that, this issue cannot remain broken in xenial. Please add task

ubuntu-server-builder commented 1 year ago

Launchpad user John George(jog) wrote on 2018-04-12T10:26:12.086526+00:00

Attaching logs from the latest reproduction of this bug. Launchpad attachments: infra-logs_da37941a-0cdd-4004-a761-481247ff8bff.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-04-12T14:47:23.505032+00:00

I'm marking this as incomplete for 2.3 provided that the latest logs show the failure on 2.3.1. Since 2.3.2 is already released, I'd like to see if 2.3.2 has improved this.

2018-04-08 10:20:28 provisioningserver.rpc.clusterservice: [info] Rack controller 'fcffxf' registered (via swoobat:pid=25312) with MAAS version 2.3.1-6470-g036d646-0ubuntu1~16.04.1.

Once we have 2.3.2 runs with this issue, please mark it as 'New'.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-04-17T22:10:40.464717+00:00

We hit this again on 2.3.2:

10.244.40.32/var/log/maas/regiond.log:2018-04-12 10:39:08 regiond: [info] 10.244.40.32 GET /metadata/latest/by-id/swywya/?op=get_preseed HTTP/1.1 --> 404 NOT_FOUND (referrer: -; agent: Cloud-Init/17.2)

Launchpad attachments: infra-logs.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-04-19T18:26:22.176471+00:00

Hi guys,

As per already discussed, Blake needs the DB dump to further debug this issue. Since we don't yet have a test run that doesn't; include the DB dump, I'll mark this as incomplete.

Please set as 'New' once we have logs with this so we can notice the status change.

Thanks!

ubuntu-server-builder commented 1 year ago

Launchpad user John George(jog) wrote on 2018-04-23T19:42:51.723420+00:00

Logs which include a database dump in the file named dump.dmp Launchpad attachments: infra-logs.tar

ubuntu-server-builder commented 1 year ago

Launchpad user Christian Reis(kiko) wrote on 2018-04-23T22:59:23.266894+00:00

And bump to NEW

ubuntu-server-builder commented 1 year ago

Launchpad user Blake Rouse(blake-rouse) wrote on 2018-04-24T14:45:47.823945+00:00

system_id | hostname | node_type | url
-----------+----------+-----------+-------------------------- byskdg | leafeon | 4 | http://10.244.40.33/MAAS wdmyam | swoobat | 4 | http://10.244.40.33/MAAS dftght | meinfoo | 4 | http://10.244.40.33/MAAS

Shows they all have the correct url, I do wonder about the missing '/' at the end. urljoin without a '/' at the end I believe will replace the /MAAS.

Just providing updates, still looking into the issue.

ubuntu-server-builder commented 1 year ago

Launchpad user Blake Rouse(blake-rouse) wrote on 2018-04-24T15:13:04.596508+00:00

Thats what I thought:

from urllib.parse import urljoin urljoin('http://localhost:5240/MAAS', '/metadata/latest/by-id/swywya/?op=get_preseed') 'http://localhost:5240/metadata/latest/by-id/swywya/?op=get_preseed'

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-04-25T02:27:28.905694+00:00

Given that theory, I'm confused how, on the same deployment of MAAS, it would work with most nodes but not with others.

ubuntu-server-builder commented 1 year ago

Launchpad user Chris Gregan(cgregan) wrote on 2018-05-08T20:42:34.572537+00:00

Field High SLA now requires that a estimated date for a fix is listed in the comments. Please provide this estimate for the open tasks.

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2018-09-12T17:42:49.087217+00:00

To provide an update. We have looked at backporting these fixes to 2.3, and they are not straight forward backport. There are a few conflicts, but these fixes are built on top of changes that are only present on 2.4+ and not present on 2.3, which makes this takes much more complicated.

As such, this changes may not be backportable at all. We'll keep you posted.

ubuntu-server-builder commented 1 year ago

Launchpad user Dean Henrichsmeyer(dean) wrote on 2018-09-12T21:49:44.841116+00:00

For some added clarification here: this doesn't backport cleanly to 2.3.x. The fix was built on machinery in 2.4 that does not exist in 2.3 making the backport more expensive and increases the risk of regressions on 2.3. Given that complexity and the fact that PODs were not officially supported until 2.4, we've decided not to backport this. Upgrading to 2.4 is the best path forward.