Closed ubuntu-server-builder closed 1 year ago
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-01-12T15:54:49.872103+00:00
Launchpad attachments: infra-logs.tar
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-01-12T16:29:07.084364+00:00
@Jason,
As per our IRC chat, while there are no console logs stored on the host, this doesn't prevent you from obtaining the logs by attempting to recommission:
Attempt to get the kernel params being sent, you will see those on the VM's KVM. (Try re-commissioning to obtain these).
Enable console log as kernel params in the specific VM, and attempt to re-commission.
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-01-17T14:15:53.679904+00:00
I hit this again today. This is a race condition - it isn't reproducible by re-commissioning the failing VM. I did that, and commissioning succeeded that time. I still grabbed the kernel command line, but I don't think it's very useful. http://paste.ubuntu.com/26404667/
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-19T20:34:26.357410+00:00
We reproduced this some last week, and I noticed that we're getting 404's when the failing node tries to retrieve its preseed:
10.244.40.31/var/log/maas/regiond.log:2018-03-15 11:28:33 regiond: [info] 10.244.40.32 GET /metadata/latest/by-id/cwerrd/?op=get_preseed HTTP/1.1 --> 404 NOT_FOUND (referrer: -; agent: Cloud-Init/17.2)
The problem there is the /MAAS prefix is missing from the path; here's a successful request from another VM: 10.244.40.32/var/log/maas/regiond.log:2018-03-15 11:29:04 regiond: [info] 10.244.40.32 GET /MAAS/metadata/latest/by-id/w3hg84/?op=get_preseed HTTP/1.1 --> 200 OK (referrer: -; agent: Cloud-Init/17.2)
notice the /MAAS in the URL on the successful one.
Now, the question is why the /MAAS is missing in the failure case.
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-19T20:48:37.861146+00:00
infra logs from the failure in the last comment. Launchpad attachments: infra-logs.tar
Launchpad user Chris Gregan(cgregan) wrote on 2018-03-26T19:47:39.302779+00:00
Upgraded to field high given the number of times we have seen this issue
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-03-26T19:49:30.248545+00:00
Marking this as incomplete in MAAS as Jason was to provide more logs when the next occurrence of this happens.
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-26T20:02:04.862784+00:00
I'm sorry, which logs was I supposed to supply?
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-26T20:25:48+00:00
Ahh right, we turned on an extra maaslog around the kernel parameters.
On Mon, Mar 26, 2018 at 3:02 PM, Jason Hobbs jason.hobbs@canonical.com wrote:
I'm sorry, which logs was I supposed to supply?
** Changed in: maas Status: Incomplete => New
-- You received this bug notification because you are subscribed to the bug report. https://bugs.launchpad.net/bugs/1742971
Title: [b1] pod created vm fails commissioning after getting 404 from metadata api
Status in cloud-init: New Status in MAAS: New
Bug description: A vm created via the pods API in maas failed to commission immediately after it was created.
It PXE booted, got initrd and the kernel, dhcp'd again, and then wasn't heard from anymore:
http://paste.ubuntu.com/26372429/
There are no rsyslog logs for it. The hostname of the vm is landscapeamqp-1.maas.
If I connect to the node with virt-viewer, it is sitting at an Ubuntu prompt, but I can't login because there are no passwords set. There are no console logs available on disk (bug 1742971).
This is with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1) in an HA setup - logs from the MAAS servers are available in the attached infra-logs.tar.
To check for this bug you can look for 404 errors like this: http://paste.ubuntu.com/p/hsbw22BFps/
This was using the default daily maas images.
To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1742971/+subscriptions
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-28T16:03:32.128179+00:00
I captured the requested log information, and it shows maas is sending the incorrect URL to the node in the kernel parameters:
10.244.40.30/var/log/maas/maas.log:Mar 28 11:59:18 leafeon maas.kernel_opts: message repeated 5 times: [ [info] ---: kernel parameters landscapeha-1 "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/generic/xenial/daily/squashfs ro ip=::::landscapeha-1:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/metadata/latest/by-id/b4hrca/?op=get_preseed apparmor=0 log_host=10.244.40.33 log_port=514"]
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-03-28T17:04:34.662835+00:00
Launchpad attachments: infra-logs (1).tar
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-03-28T18:32:29.454159+00:00
From the machine running region/rack 10.244.40.30, we see the following:
The logs show this only for 1 machine, which correctly tells the machine to download the squashfs from the rack controller, and gives it the correct IP of the metadata URL, but removes the /MAAS for some reason, which effectively makes it invalid.
Mar 28 11:59:18 leafeon maas.kernel_opts: [info] ---: kernel parameters landscapeha-1 "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/generic/xenial/daily/squashfs ro ip=::::landscapeha-1:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/metadata/latest/by-id/b4hrca/?op=get_preseed apparmor=0 log_host=10.244.40.33 log_port=514"
On the other hand, there are quite a few other logs that show that the URL is correct, with the same localtion for the images, and the correct URL for the metadata.
Mar 28 12:02:50 leafeon maas.kernel_opts: [info] ---: kernel parameters maas-enlist "nomodeset root=squash:http://10.244.40.30:5248/images/ubuntu/amd64/ga-16.04/xenial/daily/squashfs ro ip=::::maas-enlist:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.244.40.33/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed apparmor=0 log_host=10.244.40.33 log_port=514"
So somewhere in the code the /MAAS is being removed and we need to find out why.
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-04-10T21:54:37.684744+00:00
In this case it doesn't include the port:
cloud-config-url=http://10.244.40.33/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed
Launchpad user Chris Gregan(cgregan) wrote on 2018-04-12T10:06:45.725609+00:00
This bug requires a task for 2.3.x as well. SLAs do not apply to unreleased versions and besides that, this issue cannot remain broken in xenial. Please add task
Launchpad user John George(jog) wrote on 2018-04-12T10:26:12.086526+00:00
Attaching logs from the latest reproduction of this bug. Launchpad attachments: infra-logs_da37941a-0cdd-4004-a761-481247ff8bff.tar
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-04-12T14:47:23.505032+00:00
I'm marking this as incomplete for 2.3 provided that the latest logs show the failure on 2.3.1. Since 2.3.2 is already released, I'd like to see if 2.3.2 has improved this.
2018-04-08 10:20:28 provisioningserver.rpc.clusterservice: [info] Rack controller 'fcffxf' registered (via swoobat:pid=25312) with MAAS version 2.3.1-6470-g036d646-0ubuntu1~16.04.1.
Once we have 2.3.2 runs with this issue, please mark it as 'New'.
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-04-17T22:10:40.464717+00:00
We hit this again on 2.3.2:
10.244.40.32/var/log/maas/regiond.log:2018-04-12 10:39:08 regiond: [info] 10.244.40.32 GET /metadata/latest/by-id/swywya/?op=get_preseed HTTP/1.1 --> 404 NOT_FOUND (referrer: -; agent: Cloud-Init/17.2)
Launchpad attachments: infra-logs.tar
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-04-19T18:26:22.176471+00:00
Hi guys,
As per already discussed, Blake needs the DB dump to further debug this issue. Since we don't yet have a test run that doesn't; include the DB dump, I'll mark this as incomplete.
Please set as 'New' once we have logs with this so we can notice the status change.
Thanks!
Launchpad user John George(jog) wrote on 2018-04-23T19:42:51.723420+00:00
Logs which include a database dump in the file named dump.dmp Launchpad attachments: infra-logs.tar
Launchpad user Christian Reis(kiko) wrote on 2018-04-23T22:59:23.266894+00:00
And bump to NEW
Launchpad user Blake Rouse(blake-rouse) wrote on 2018-04-24T14:45:47.823945+00:00
system_id | hostname | node_type | url
-----------+----------+-----------+--------------------------
byskdg | leafeon | 4 | http://10.244.40.33/MAAS
wdmyam | swoobat | 4 | http://10.244.40.33/MAAS
dftght | meinfoo | 4 | http://10.244.40.33/MAAS
Shows they all have the correct url, I do wonder about the missing '/' at the end. urljoin without a '/' at the end I believe will replace the /MAAS.
Just providing updates, still looking into the issue.
Launchpad user Blake Rouse(blake-rouse) wrote on 2018-04-24T15:13:04.596508+00:00
Thats what I thought:
from urllib.parse import urljoin urljoin('http://localhost:5240/MAAS', '/metadata/latest/by-id/swywya/?op=get_preseed') 'http://localhost:5240/metadata/latest/by-id/swywya/?op=get_preseed'
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-04-25T02:27:28.905694+00:00
Given that theory, I'm confused how, on the same deployment of MAAS, it would work with most nodes but not with others.
Launchpad user Chris Gregan(cgregan) wrote on 2018-05-08T20:42:34.572537+00:00
Field High SLA now requires that a estimated date for a fix is listed in the comments. Please provide this estimate for the open tasks.
Launchpad user Andres Rodriguez(andreserl) wrote on 2018-09-12T17:42:49.087217+00:00
To provide an update. We have looked at backporting these fixes to 2.3, and they are not straight forward backport. There are a few conflicts, but these fixes are built on top of changes that are only present on 2.4+ and not present on 2.3, which makes this takes much more complicated.
As such, this changes may not be backportable at all. We'll keep you posted.
Launchpad user Dean Henrichsmeyer(dean) wrote on 2018-09-12T21:49:44.841116+00:00
For some added clarification here: this doesn't backport cleanly to 2.3.x. The fix was built on machinery in 2.4 that does not exist in 2.3 making the backport more expensive and increases the risk of regressions on 2.3. Given that complexity and the fact that PODs were not officially supported until 2.4, we've decided not to backport this. Upgrading to 2.4 is the best path forward.
This bug was originally filed in Launchpad as LP: #1742971
Launchpad details
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2018-01-12T15:54:49.872103+00:00
A vm created via the pods API in maas failed to commission immediately after it was created.
It PXE booted, got initrd and the kernel, dhcp'd again, and then wasn't heard from anymore:
http://paste.ubuntu.com/26372429/
There are no rsyslog logs for it. The hostname of the vm is landscapeamqp-1.maas.
If I connect to the node with virt-viewer, it is sitting at an Ubuntu prompt, but I can't login because there are no passwords set. There are no console logs available on disk (bug 1742971).
This is with maas 2.3.0 (6434-gd354690-0ubuntu1~16.04.1) in an HA setup - logs from the MAAS servers are available in the attached infra-logs.tar.
To check for this bug you can look for 404 errors like this: http://paste.ubuntu.com/p/hsbw22BFps/
This was using the default daily maas images.