canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.87k stars 856 forks source link

Machines fail to deploy because cloud-init needs to accept both netplan spellings for grat arp #3377

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1827238

Launchpad details
affected_projects = ['curtin', 'maas']
assignee = raharper
assignee_name = Ryan Harper
date_closed = 2019-07-17T17:12:20.937007+00:00
date_created = 2019-05-01T15:46:56.986977+00:00
date_fix_committed = 2019-05-29T04:59:48.471386+00:00
date_fix_released = 2019-07-17T17:12:20.937007+00:00
id = 1827238
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1827238
milestone = None
owner = andreserl
owner_name = Andres Rodriguez
private = False
status = fix_released
submitter = jason-hobbs
submitter_name = Jason Hobbs
tags = ['cdo-qa', 'cdo-release-blocker', 'foundations-engine']
duplicates = []

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-01T15:46:56.986977+00:00

Many nodes failed to boot after installation.

Here is one example, beartic.

beartic. finishes install: 2019-05-01T10:36:17+00:00 beartic systemd[1]: Stopping Unattended Upgrades Shutdown...

dhcp's after reboot: 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:10+00:00 leafeon dhcpd[21312]: DHCPDISCOVER from 14:02:ec:41:c7:dc via broam 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:10+00:00 leafeon dhcpd[21312]: DHCPOFFER on 10.244.41.7 to 14:02:ec:41:c7:dc via broam 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:13+00:00 leafeon dhcpd[21312]: DHCPREQUEST for 10.244.41.7 (10.244.40.30) from 14:02:ec:41:c7:dc via broam 10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:13+00:00 leafeon dhcpd[21312]: DHCPACK on 10.244.41.7 to 14:02:ec:41:c7:dc via broam

grub and grub.cfg: 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7

but we never got any rsyslog message or api calls after that.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-01T15:49:25.101966+00:00

Launchpad attachments: logs-2019-05-01-14.17.40 (1).tar

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-01T16:34:38.302174+00:00

So this is what I see on the logs:

  1. On rackd.log on .32, I see the machine PXE boot to start the deployment process:

2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.244.41.7 2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.244.41.7 2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7

  1. On rackd.log on .30, I see it pxe boot post-deployment (and its told to localboot):

2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7

  1. I see that curtin has run the deployment process and hasn't reported any errors - log: https://pastebin.ubuntu.com/p/zMgTttxdSj/ | curtin config: https://pastebin.ubuntu.com/p/Y2ZMX6Rstd/

So, from all the information above, I don't think we have enough information to know what the issue is.

A. The machine was never instructed to localboot. B. The machine was instructed to localboot, but grub failed. C. The machine booted onto the disk, but either didn't get network or failed to contact metadata. D. There is a firmware issue preventing the machine from accessing the deployed environment.

Looking at the looks, it seems that:

A -> the machine did indeed reboot and accessed the grub config and instructed to localboot. B -> We don't know if grub failed, because we have no console logs. C -> Could be the case that network was not configured properly on reboot either due to cloud-init or a bug in netplan. For this we need console logs. D -> WE need console logs.

So from all the info here, I'm marking this bug as incomplete as we would really need console logs to determine what's the issues. That said, these could also be curtin issues when, while it succeeded, it could have caused have misconfigured something for which the machine never really boot into the installed environment. So, I'm adding curtin to see if they can help us.

@Jason, quick q, are the machines that failed to boot all grub? it seems to me that's the case but just want to double check.

Lastly, we would really need console logs. @Jason, you can setup conserver to automatically gather the logs from the console and share those with MAAS.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-01T16:43:18+00:00

On Wed, May 1, 2019 at 11:40 AM Andres Rodriguez andreserl@ubuntu-pe.org wrote:

So this is what I see on the logs:

  1. On rackd.log on .32, I see the machine PXE boot to start the deployment process:

2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7 2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.244.41.7 2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.244.41.7 2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7

  1. On rackd.log on .30, I see it pxe boot post-deployment (and its told to localboot):

2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7 2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7

  1. I see that curtin has run the deployment process and hasn't reported any errors - log: https://pastebin.ubuntu.com/p/zMgTttxdSj/ | curtin config: https://pastebin.ubuntu.com/p/Y2ZMX6Rstd/

So, from all the information above, I don't think we have enough information to know what the issue is.

A. The machine was never instructed to localboot. B. The machine was instructed to localboot, but grub failed. C. The machine booted onto the disk, but either didn't get network or failed to contact metadata. D. There is a firmware issue preventing the machine from accessing the deployed environment.

Looking at the looks, it seems that:

A -> the machine did indeed reboot and accessed the grub config and instructed to localboot. B -> We don't know if grub failed, because we have no console logs. C -> Could be the case that network was not configured properly on reboot either due to cloud-init or a bug in netplan. For this we need console logs. D -> WE need console logs.

So from all the info here, I'm marking this bug as incomplete as we would really need console logs to determine what's the issues. That said, these could also be curtin issues when, while it succeeded, it could have caused have misconfigured something for which the machine never really boot into the installed environment. So, I'm adding curtin to see if they can help us.

@Jason, quick q, are the machines that failed to boot all grub? it seems to me that's the case but just want to double check.

I'm not sure what you mean by this - they are all UEFI machines. They should be using EFI, not using grub from an MBR.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-05-01T17:45:53.958501+00:00

I don't see any errors during this install for curtin. It installs grub2 uefi, adds the entry (0017), reorders back to boot from network (0016).

Please reopen the curtin task if you believe there's a curtin error during the installation.

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-02T19:48:14.627946+00:00

So there are no apparent issues from the MAAS perspective, we really need to see console logs to be able to determine what's the issue. Keeping this as incomplete for MAAS until console logs can be provided.

PS. You can add kernel params to the machine to log to console, and setup conserver-server to remotely gather the information automatically without manual intervention (once conserver-server has been setup of course).

ubuntu-server-builder commented 1 year ago

Launchpad user Blake Rouse(blake-rouse) wrote on 2019-05-02T20:28:04.760962+00:00

Do you have a tcpdump of the network traffic? That is needed to be able to determine what the issue is. We need to inspect the HTTP headers that is being sent from nginx to the client. Specifically the "Content-Length" header to see if there is a mismatch there.

It is possible that the correct Content-Length is being sent to the client, but the client is either closing the HTTP connection to soon or a TCP reset is occurring. That will still report as a 200 response on the server side as the response was 200. The difference will be if the actual amount of data sent to the client matches the Content-Length of the response HTTP header.

Knowing if those mismatch will be the first step, because there is a few outcomes:

  1. The server is reading the file wrong and setting the Content-Length not to the actual file length.
  2. The server is sending the file and something occurs for the file handler to be closed so the streaming action is stopped (I would expect a log in nginx errors if this were to occur, but maybe not)
  3. The client is closing the connection before reading all the data being streamed from the server.
  4. A TCP reset is occurring break the TCP connection, giving the same result of #3 but caused by TCP reset instead of client disconnect.

Also the output of the whole tree structure of /var/lib/maas/boot-resources and byte sizes (in bytes) so the Content-Length can be compared to the actual file.

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-03T15:06:10.309249+00:00

I'm struggling to find the right parameters to get serial console. I did notice we get this error when booting after the install

Booting local disk... Failed to open \efi\boot\grubx64.efi - Not Found Failed to load image \efi\boot\grubx64.efi: Not Found start_image() returned Not Found

ubuntu-server-builder commented 1 year ago

Launchpad user Jason Hobbs(jason-hobbs) wrote on 2019-05-03T15:41:12.217127+00:00

console output: http://paste.ubuntu.com/p/HyCmWrHYqw/

[ 55.404228] cloud-init[1039]: self._handle_bond_bridge(command, cmd_type='bond') [ 55.416254] cloud-init[1039]: File "/usr/lib/python3/dist-packages/cloudinit/net/network_state.py", line 690, in _handle_bond_bridge [ 55.428150] cloud-init[1039]: item_params.get('parameters', {}).items()) [ 55.440297] cloud-init[1039]: File "/usr/lib/python3/dist-packages/cloudinit/net/network_state.py", line 689, in [ 55.452161] cloud-init[1039]: 'params': dict((v2key_to_v1[k], v) for k, v in

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-03T15:52:55.963620+00:00

MAAS sends this snippet:

network: bonds: bond0: interfaces:

ubuntu-server-builder commented 1 year ago

Launchpad user Andres Rodriguez(andreserl) wrote on 2019-05-03T16:25:09.453106+00:00

To document what the issue is, MAAS sends the key 'gratuitous-arp', following netplan, however cloud-init misspells this key as 'gratuitious-arp'. MAAS will workaround this issue until cloud-init fixes it.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-05-03T17:01:22.168500+00:00

cloud-init included the only (mis)spelling of gratuitous-arp, per

https://bugs.launchpad.net/netplan/+bug/1756701

This is now fixed in netplan and released. Cloud-init needs to accept both values now.

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-05-04T00:34:00.969762+00:00

The misspelling is part of the original netplan; it has been fixed but only in bionic and newer. For max compatibility it's best to always render/supply the misspelling. Cloud-init will accept either key, but for now will only render the misspelling as all netplan releases support this key.

ubuntu-server-builder commented 1 year ago

Launchpad user Steve Langasek(vorlon) wrote on 2019-05-04T00:55:34+00:00

On Sat, May 04, 2019 at 12:34:00AM -0000, Ryan Harper wrote:

The misspelling is part of the original netplan; it has been fixed but only in bionic and newer. For max compatibility it's best to always render/supply the misspelling. Cloud-init will accept either key, but for now will only render the misspelling as all netplan releases support this key.

In the original netplan, it's also misspelled in the output to networkd, so probably never worked? From xenial, and bionic release pocket:

./src/networkd.c: g_string_append_printf(params, "\nGratuitiousARP=%d", def->bond_params.gratuitious_arp);

In the NM backend that's spelled 'num_grat_arp' in the output so should work, but that's also largely irrelevant for cloud-init I would think.

ubuntu-server-builder commented 1 year ago

Launchpad user Server Team CI bot(server-team-bot) wrote on 2019-05-29T04:59:46.785687+00:00

This bug is fixed with commit ded1ec81 to cloud-init on branch master. To view that commit see the following URL: https://git.launchpad.net/cloud-init/commit/?id=ded1ec81

ubuntu-server-builder commented 1 year ago

Launchpad user Ryan Harper(raharper) wrote on 2019-07-17T17:12:22.953666+00:00

This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.