Closed ubuntu-server-builder closed 1 year ago
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-01T15:49:25.101966+00:00
Launchpad attachments: logs-2019-05-01-14.17.40 (1).tar
Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-01T16:34:38.302174+00:00
So this is what I see on the logs:
2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.244.41.7 2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.244.41.7 2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7
2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
So, from all the information above, I don't think we have enough information to know what the issue is.
A. The machine was never instructed to localboot. B. The machine was instructed to localboot, but grub failed. C. The machine booted onto the disk, but either didn't get network or failed to contact metadata. D. There is a firmware issue preventing the machine from accessing the deployed environment.
Looking at the looks, it seems that:
A -> the machine did indeed reboot and accessed the grub config and instructed to localboot. B -> We don't know if grub failed, because we have no console logs. C -> Could be the case that network was not configured properly on reboot either due to cloud-init or a bug in netplan. For this we need console logs. D -> WE need console logs.
So from all the info here, I'm marking this bug as incomplete as we would really need console logs to determine what's the issues. That said, these could also be curtin issues when, while it succeeded, it could have caused have misconfigured something for which the machine never really boot into the installed environment. So, I'm adding curtin to see if they can help us.
@Jason, quick q, are the machines that failed to boot all grub? it seems to me that's the case but just want to double check.
Lastly, we would really need console logs. @Jason, you can setup conserver to automatically gather the logs from the console and share those with MAAS.
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-01T16:43:18+00:00
On Wed, May 1, 2019 at 11:40 AM Andres Rodriguez andreserl@ubuntu-pe.org wrote:
So this is what I see on the logs:
- On rackd.log on .32, I see the machine PXE boot to start the deployment process:
2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.244.41.7 2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.244.41.7 2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7
- On rackd.log on .30, I see it pxe boot post-deployment (and its told to localboot):
2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
- I see that curtin has run the deployment process and hasn't reported any errors - log: https://pastebin.ubuntu.com/p/zMgTttxdSj/ | curtin config: https://pastebin.ubuntu.com/p/Y2ZMX6Rstd/
So, from all the information above, I don't think we have enough information to know what the issue is.
A. The machine was never instructed to localboot. B. The machine was instructed to localboot, but grub failed. C. The machine booted onto the disk, but either didn't get network or failed to contact metadata. D. There is a firmware issue preventing the machine from accessing the deployed environment.
Looking at the looks, it seems that:
A -> the machine did indeed reboot and accessed the grub config and instructed to localboot. B -> We don't know if grub failed, because we have no console logs. C -> Could be the case that network was not configured properly on reboot either due to cloud-init or a bug in netplan. For this we need console logs. D -> WE need console logs.
So from all the info here, I'm marking this bug as incomplete as we would really need console logs to determine what's the issues. That said, these could also be curtin issues when, while it succeeded, it could have caused have misconfigured something for which the machine never really boot into the installed environment. So, I'm adding curtin to see if they can help us.
@Jason, quick q, are the machines that failed to boot all grub? it seems to me that's the case but just want to double check.
I'm not sure what you mean by this - they are all UEFI machines. They should be using EFI, not using grub from an MBR.
Launchpad user Ryan Harper(raharper) wrote on 2019-05-01T17:45:53.958501+00:00
I don't see any errors during this install for curtin. It installs grub2 uefi, adds the entry (0017), reorders back to boot from network (0016).
Please reopen the curtin task if you believe there's a curtin error during the installation.
Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-02T19:48:14.627946+00:00
So there are no apparent issues from the MAAS perspective, we really need to see console logs to be able to determine what's the issue. Keeping this as incomplete for MAAS until console logs can be provided.
PS. You can add kernel params to the machine to log to console, and setup conserver-server to remotely gather the information automatically without manual intervention (once conserver-server has been setup of course).
Launchpad user Blake Rouse(blake-rouse) wrote on 2019-05-02T20:28:04.760962+00:00
Do you have a tcpdump of the network traffic? That is needed to be able to determine what the issue is. We need to inspect the HTTP headers that is being sent from nginx to the client. Specifically the "Content-Length" header to see if there is a mismatch there.
It is possible that the correct Content-Length is being sent to the client, but the client is either closing the HTTP connection to soon or a TCP reset is occurring. That will still report as a 200 response on the server side as the response was 200. The difference will be if the actual amount of data sent to the client matches the Content-Length of the response HTTP header.
Knowing if those mismatch will be the first step, because there is a few outcomes:
Also the output of the whole tree structure of /var/lib/maas/boot-resources and byte sizes (in bytes) so the Content-Length can be compared to the actual file.
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-03T15:06:10.309249+00:00
I'm struggling to find the right parameters to get serial console. I did notice we get this error when booting after the install
Booting local disk... Failed to open \efi\boot\grubx64.efi - Not Found Failed to load image \efi\boot\grubx64.efi: Not Found start_image() returned Not Found
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-03T15:41:12.217127+00:00
console output: http://paste.ubuntu.com/p/HyCmWrHYqw/
[ 55.404228] cloud-init[1039]: self._handle_bond_bridge(command, cmd_type='bond')
[ 55.416254] cloud-init[1039]: File "/usr/lib/python3/dist-packages/cloudinit/net/network_state.py", line 690, in _handle_bond_bridge
[ 55.428150] cloud-init[1039]: item_params.get('parameters', {}).items())
[ 55.440297] cloud-init[1039]: File "/usr/lib/python3/dist-packages/cloudinit/net/network_state.py", line 689, in
Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-03T15:52:55.963620+00:00
MAAS sends this snippet:
network: bonds: bond0: interfaces:
Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-03T16:25:09.453106+00:00
To document what the issue is, MAAS sends the key 'gratuitous-arp', following netplan, however cloud-init misspells this key as 'gratuitious-arp'. MAAS will workaround this issue until cloud-init fixes it.
Launchpad user Ryan Harper(raharper) wrote on 2019-05-03T17:01:22.168500+00:00
cloud-init included the only (mis)spelling of gratuitous-arp, per
https://bugs.launchpad.net/netplan/+bug/1756701
This is now fixed in netplan and released. Cloud-init needs to accept both values now.
Launchpad user Ryan Harper(raharper) wrote on 2019-05-04T00:34:00.969762+00:00
The misspelling is part of the original netplan; it has been fixed but only in bionic and newer. For max compatibility it's best to always render/supply the misspelling. Cloud-init will accept either key, but for now will only render the misspelling as all netplan releases support this key.
Launchpad user Steve Langasek(vorlon) wrote on 2019-05-04T00:55:34+00:00
On Sat, May 04, 2019 at 12:34:00AM -0000, Ryan Harper wrote:
The misspelling is part of the original netplan; it has been fixed but only in bionic and newer. For max compatibility it's best to always render/supply the misspelling. Cloud-init will accept either key, but for now will only render the misspelling as all netplan releases support this key.
In the original netplan, it's also misspelled in the output to networkd, so probably never worked? From xenial, and bionic release pocket:
./src/networkd.c: g_string_append_printf(params, "\nGratuitiousARP=%d", def->bond_params.gratuitious_arp);
In the NM backend that's spelled 'num_grat_arp' in the output so should work, but that's also largely irrelevant for cloud-init I would think.
Launchpad user Server Team CI bot(server-team-bot) wrote on 2019-05-29T04:59:46.785687+00:00
This bug is fixed with commit ded1ec81 to cloud-init on branch master. To view that commit see the following URL: https://git.launchpad.net/cloud-init/commit/?id=ded1ec81
Launchpad user Ryan Harper(raharper) wrote on 2019-07-17T17:12:22.953666+00:00
This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
This bug was originally filed in Launchpad as LP: #1827238
Launchpad details
Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-01T15:46:56.986977+00:00
Many nodes failed to boot after installation.
Here is one example, beartic.
beartic. finishes install: 2019-05-01T10:36:17+00:00 beartic systemd[1]: Stopping Unattended Upgrades Shutdown...
dhcp's after reboot: 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:10+00:00 leafeon dhcpd[21312]: DHCPDISCOVER from 14:02:ec:41:c7:dc via broam 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:10+00:00 leafeon dhcpd[21312]: DHCPOFFER on 10.244.41.7 to 14:02:ec:41:c7:dc via broam 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:13+00:00 leafeon dhcpd[21312]: DHCPREQUEST for 10.244.41.7 (10.244.40.30) from 14:02:ec:41:c7:dc via broam 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:13+00:00 leafeon dhcpd[21312]: DHCPACK on 10.244.41.7 to 14:02:ec:41:c7:dc via broam
grub and grub.cfg: 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
but we never got any rsyslog message or api calls after that.