canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.93k stars 871 forks source link

Cloud images fail to boot when a serial port is not available #2657

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1573095

Launchpad details
affected_projects = ['cloud-images', 'ubuntu', 'initramfs-tools (Ubuntu)']
assignee = None
assignee_name = None
date_closed = 2016-07-12T08:59:33.212399+00:00
date_created = 2016-04-21T15:14:42.118671+00:00
date_fix_committed = None
date_fix_released = None
id = 1573095
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1573095
milestone = None
owner = x-rbuntu-z
owner_name = zero
private = False
status = invalid
submitter = x-rbuntu-z
submitter_name = zero
tags = ['id-5b49154499e416396a3e983c', 'sts', 'xenial']
duplicates = []

Launchpad user zero(x-rbuntu-z) wrote on 2016-04-21T15:14:42.118671+00:00

I tried to launch a ubuntu 16.04 cloud image within KVM. The image is not booting up and hangs at

"Btrfs loaded"

Hypervisor env is Proxmox 4.1

[racb: see comment 40 for minimal steps to reproduce using Ubuntu-provided tooling only]

Related bugs:

ubuntu-server-builder commented 1 year ago

Launchpad user Nick Douma(lordgaav) wrote on 2016-04-24T21:23:40.853519+00:00

Can confirm this bug, attached is a screenshot. The VM will hang and have a CPU load of 100%, but the boot will never continue. Launchpad attachments: xenial-boot-freeze.png

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2016-04-24T21:23:49.707680+00:00

Status changed to 'Confirmed' because the bug affects multiple users.

ubuntu-server-builder commented 1 year ago

Launchpad user Kenneth Østrup(kennetho) wrote on 2016-04-25T09:08:03.724829+00:00

I am also seeing this issue, with the same results as screenshot submitted by Nick Douma.

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2016-04-25T10:49:37.167394+00:00

Hi zero, Kenneth, Nick,

Thanks for reporting and confirming this bug! Could one of you include a list of instructions to reliably reproduce this, please? That will make it much easier for someone investigating the bug to be sure that they are hitting the same issue that you are. :)

Thanks,

Dan

ubuntu-server-builder commented 1 year ago

Launchpad user zero(x-rbuntu-z) wrote on 2016-04-27T08:40:34.070412+00:00

Hello,

Here is the steps I followed to reproduce the bug on Proxmox 4.1 :

  1. Download current cloud image (cow image version) at : https://uec-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
  2. Put image inside image storage location on proxmox
  3. Run qemu-img resize $image_path 20G (might be optional to reproduce the issue)
  4. Launch VM with the following command :

pvesh create /nodes/$hostname/qemu -name $hostname -bootdisk virtio0 -vmid $vmid -memory 1024 -sockets 1 -cores 1 -net0 virtio,bridge=vmbr0 -virtio0=local:$vmid/$image_path

  1. Start the created VM and display the console
  2. Boot will hang at "Btrfs loaded"
ubuntu-server-builder commented 1 year ago

Launchpad user Rodrigo Bahiense(rodbzro) wrote on 2016-04-29T15:40:04.958306+00:00

I'm also having this issue.

Tried with the .img and .vmdk distributions of "xenial-server-cloudimg-amd64-disk1".

Using VirtualBox 5.0.16r105871 on Windows 10 Pro x64 Build 10586

The boot freezes at the same point demonstrated in the #1 comment screenshot: https://bugs.launchpad.net/cloud-images/+bug/1573095/+attachment/4645921/+files/xenial-boot-freeze.png

ubuntu-server-builder commented 1 year ago

Launchpad user John Petrini(john-d-petrini) wrote on 2016-04-30T22:47:03.290509+00:00

I'm experiencing this bug also. Running KVM on a 16.04 host. Hangs at Btrfs loaded.

ubuntu-server-builder commented 1 year ago

Launchpad user John Petrini(john-d-petrini) wrote on 2016-04-30T22:54:30.475002+00:00

I should add that the cloud image does work in our OpenStack environment which is running KVM on 14.04 qemu-kvm version 1:2.5+dfsg-5ubuntu10. It does not work on 16.04 with qemu-kvm version 1:2.5+dfsg-5ubuntu10.

ubuntu-server-builder commented 1 year ago

Launchpad user John Petrini(john-d-petrini) wrote on 2016-04-30T22:55:45.746576+00:00

Sorry copy paste mistake. OpenStack is running qemu-kvm version 2.0.0+dfsg-2ubuntu1.22.

ubuntu-server-builder commented 1 year ago

Launchpad user zero(x-rbuntu-z) wrote on 2016-05-03T07:39:00.434885+00:00

Hello,

I tried again with the build 20160502 and have the same issue.

ubuntu-server-builder commented 1 year ago

Launchpad user zero(x-rbuntu-z) wrote on 2016-05-11T10:08:36.503562+00:00

Hello,

Does anyone have an idea of what might be the root cause of this issue ?

I'm happy to help but don't really know where to look/investigate

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2016-05-23T19:34:50.434833+00:00

I suspect the issue is related to cloud-init writing networking configuration data. Could you please shut down the system and then mount it (mount-image-callback will mount easily enough) and copy out /var/log/cloud-init.log ?

The other possibility is related to bug 1577844 .

In both cases tehre should be timeouts eventually (maybe the 5 minute mark) that continue with boot, but likely without networking.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-05-24T02:57:16.184726+00:00

I can confirm there is no timeout, it hangs forever (at least, I left it overnight).

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-05-24T03:05:18.581231+00:00

Shutdown doesn't work either, I needed a hard stop. After mounting the image, there's no cloud-init.log.

ubuntu-server-builder commented 1 year ago

Launchpad user Fryderyk Dziarmagowski(freddix) wrote on 2016-05-31T13:52:46.993689+00:00

Here is a workaround (or better said two) I am using (after converting it to raw) to get it work in Proxmox:

sudo kpartx -a xenial-server-cloudimg-amd64-disk1.raw sudo mkdir -p /tmp/foo && sudo mount /dev/mapper/loop0p1 /tmp/foo

replace console=ttyS0 from

/tmp/foo/boot/grub/grub.cfg /tmp/foo/etc/default/grub

with net.ifnames=0

sudo umount /tmp/foo sudo kpartx -d xenial-server-cloudimg-amd64-disk1.raw

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2016-05-31T16:18:49.947716+00:00

Julian, Fryderyk, or someone else who's affected,

If you aren't seeing a cloud-init.log on affected instances, could you instead tar up all of /var/log and put it somewhere we can examine?

Thanks,

Dan

ubuntu-server-builder commented 1 year ago

Launchpad user Łukasz Leszczuk(lukasz-leszczuk) wrote on 2016-06-02T17:35:01.189136+00:00

I am experiencing same issue when booting on bare metal server with Ironic.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-06-16T08:46:03.957324+00:00

I am now seeing the same with the 12.04 images currently up 20160607/ 07-Jun-2016 06:49 -
 20160610.1/ 11-Jun-2016 05:13 -
 20160610/ 10-Jun-2016 12:13 -

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-06-16T08:55:49+00:00

On Tuesday, 31 May 2016 16:18:49 AEST you wrote:

Julian, Fryderyk, or someone else who's affected,

If you aren't seeing a cloud-init.log on affected instances, could you instead tar up all of /var/log and put it somewhere we can examine?

The problem is that the disk image doesn't get flushed at any point, so there's nothing in the logs at all - it's the original qcow. And because I have to hard kill the VM, it will never flush.

root@proxmox15:/var/lib/vz/images/204# qemu-nbd --connect=/dev/nbd0 vm- disk-0.qcow2 root@proxmox15:/var/lib/vz/images/204# mount /dev/nbd0p1 /mnt/tmp
root@proxmox15:/var/lib/vz/images/204# ls /mnt/tmp/var/log apt btmp dist-upgrade fsck landscape lastlog unattended-upgrades
upstart wtmp

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-06-16T08:56:15.331280+00:00

15.10 images seem to work, however.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-06-16T09:16:18.661606+00:00

I can confirm the workaround above, removing console=ttyS0 from the kernel parameters stops it from hanging.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-07-04T01:46:10.797607+00:00

Is a permanent resolution imminent on this? The faulty cloud image renders it useless on various platforms.

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2016-07-04T08:18:55.973041+00:00

Hi Julian,

It's still not 100% clear to me what is actually causing the problem, and what workaround fixed it. Can you describe precisely what workaround you used to get a booting image?

Thanks,

Dan

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Edwards(julian-edwards) wrote on 2016-07-05T04:55:32.997604+00:00

Hi - I just loop mounted the image and removed the console=ttyS0 from the kernel args in the grub config, and it boots fine.

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2016-07-05T08:13:18.405158+00:00

Hi Julian,

Does enabling serial consoles in proxmox[0] fix the issue for you?

Dan

[0] https://pve.proxmox.com/wiki/Serial_Terminal

ubuntu-server-builder commented 1 year ago

Launchpad user Vladimir Rutsky(rutsky) wrote on 2016-07-27T23:15:27.577483+00:00

This bug looks similar to https://bugs.launchpad.net/ubuntu/+source/livecd-rootfs/+bug/1546108

ubuntu-server-builder commented 1 year ago

Launchpad user Mark - Syminet(mark-syminet) wrote on 2016-09-09T22:49:55.200451+00:00

Most recent image as of today also hard-locked, ttyS0 fix described above worked.

ubuntu-server-builder commented 1 year ago

Launchpad user MaxZhang(maxzhangx) wrote on 2016-11-30T07:21:59.041607+00:00

Hi,

I think the problem may be that the ttyS0's parameter is not complete, the speed is not set, change it from: console=ttyS0 to: console=ttyS0,115200n8

would fix it.

ubuntu-server-builder commented 1 year ago

Launchpad user KingJ(kj-kingj) wrote on 2016-12-27T14:11:42.102455+00:00

I can confirm that I am affected by this, running on ESXi 6.5.

I took a slightly different approach to fixing it - adding a virtual serial port to the VM's hardware allowed it to boot successfully.

ubuntu-server-builder commented 1 year ago

Launchpad user Sebastian(sebek-h) wrote on 2017-01-24T19:46:38.713884+00:00

this problem affected my envirnoment with MAAS and img 16.04/16.10/17.04 On some servers we use console with ttyS0 on other ttyS1 Remove console=ttyS1,115200n8 from Global Kernel Parameters in MAAS resolve problem (partly) Problem not occurs on 14.04

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-05-30T02:43:36.668551+00:00

Status changed to 'Confirmed' because the bug affects multiple users.

ubuntu-server-builder commented 1 year ago

Launchpad user Mathieu Mitchell(mat128) wrote on 2017-08-21T12:55:45.331898+00:00

Any update on when the updated kernel parameters can make it to official cloud images?

Also worth noting, OpenStack image docs [1] also indicate ttyS0 at 115200n8.

1: https://docs.openstack.org/image-guide/openstack-images.html#ensure-image-writes-boot-log-to-console

ubuntu-server-builder commented 1 year ago

Launchpad user Evan Felix(karcaw) wrote on 2017-12-21T00:25:26.191989+00:00

I am seeing this issue when booting 16.04 images under ovirt. if i add a serial console to the VM it boots fine.

ubuntu-server-builder commented 1 year ago

Launchpad user Evan Felix(karcaw) wrote on 2017-12-21T16:27:22.776823+00:00

I can also confirm that this issue happens in the cloud images for xenial, zesty, artful, and current bionic

ubuntu-server-builder commented 1 year ago

Launchpad user Andrew Paxson(paxsonsa) wrote on 2018-01-10T07:13:56.122695+00:00

I am not sure if this is relevant to your inquiry but I also found having to add a isa-serial (in virt-manager thats Serial PPTY) to the machine, it then when past that section.

ubuntu-server-builder commented 1 year ago

Launchpad user Keenan Verbrugge(keenanv) wrote on 2018-01-23T19:12:54.645710+00:00

Same issue here. Using ubuntu 16.04

Adding a console for qemu/kvm was able to get me past this:

virsh edit vmname

add:

ubuntu-server-builder commented 1 year ago

Launchpad user ironstorm(ironstorm-gmail) wrote on 2018-04-04T17:31:56.679330+00:00

The same problem exists on VirtualBox using the Apr 2 nightly of bionic cloud image... :(

Workaround on Virtualbox is to add a disconnected serial port to allow booting to continue using the following:

VBoxManage modifyvm "${VM}" --uart1 0x3f8 4 --uartmode1 disconnected

ubuntu-server-builder commented 1 year ago

Launchpad user Edward Vielmetti(edward-vielmetti) wrote on 2018-06-07T13:59:33.408521+00:00

This problem also reported at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/356

If someone who has seen this has done a workaround specifically for Openstack I'd appreciate it.

ubuntu-server-builder commented 1 year ago

Launchpad user Jose Phillips(jose-phillips) wrote on 2018-11-23T10:57:01.768174+00:00

Hi Everyone

Just add a serial port and will fix the issue Cloud Images try to log the boot to serial port 1

ubuntu-server-builder commented 1 year ago

Launchpad user Robie Basak(racb) wrote on 2019-01-21T21:47:41.786582+00:00

Here are full steps to reproduce this issue using tooling from Ubuntu only:

uvt-simplestreams-libvirt sync release=bionic arch=amd64 label=release uvt-kvm create --no-start lp1573095 release=bionic arch=amd64 label=release virsh edit lp1573095 # delete and blocks virsh start lp1573095 uvt-kvm wait lp1573095

Expected behaviour: succeeds when the VM is available Actual behaviour: hangs and eventually times out

Additionally you can examine the screen with virt-manager. On that screen, I expect a login prompt. Instead I see nothing beyond the normal kernel messages (nothing from userspace).

If you skip the serial/console definition deletion in the steps above, you'll see that the VM works. In other words, the VM stops working if a serial port is not available.

Workaround: remove console=ttyS0 from GRUB_CMD_LINUX_DEFAULT in /etc/default/grub.d/50-cloudimg-settings.cfg, leaving only console=tty1, and then run "sudo update-grub". However this must either be done on a system with aserial port, or you have to jump through the appropriate hoops to be able to get the result of "update-grub" happen without having booted the system. Note that editing /etc/default/grub is insufficient since /etc/default/grub.d/50-cloudimg-settings.cfg overrides it (see bug 1812752).

ubuntu-server-builder commented 1 year ago

Launchpad user Jeremy Busk(busk) wrote on 2019-03-01T00:12:18.955184+00:00

While you can workaround the issue with

sudo sed -i 's/ console=ttyS0//g' /etc/default/grub.d/50-cloudimg-settings.cfg
sudo update-grub

You need ttyS0 in grub in order to interact with vm guest using

virsh console <vm-name>

I added a bug to virtualbox as it could be a compound issue or an issue on how they handle ttyS0 from os. https://www.virtualbox.org/ticket/18463

ubuntu-server-builder commented 1 year ago

Launchpad user David(davidjaquier) wrote on 2019-03-18T16:54:19.845062+00:00

Have the same trouble when I try to deploy cloud images based templates in a cloudstack managed environment on top of esxi 6.5 (GTT VDC).

Is there a way to remove that without deploying a virtual machine? I tried to tar -x the ova, modify the vmdk via guestmount on ubuntu 18 or via fuse for osx, without success.

If someone can tell me an efficient and short way to remove this setting from the .ova, it could be really great.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2019-03-26T12:59:05.807272+00:00

I just added a bunch of other bugs that really are dups of this. The goal of doing so is just to inform whoever might be looking at making a change to more context on the unfortunate complexity of doing so.

Related bugs:

ubuntu-server-builder commented 1 year ago

Launchpad user Grant Emsley(grantemsley) wrote on 2019-05-02T02:00:05.129331+00:00

I ran into this bug trying to use cloud images on Hyper-V.

The workaround in #40 does work - add a serial console to the VM, and change /etc/default/grub.d/50-cloudimg-settings.cfg

If you still want to be able to use a serial console if available, but not require it to be able to boot, just change the line from 'GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"' to 'GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 console=tty1"'

Then run update-grub.

It seems /dev/console takes on whichever console is listed LAST in the kernel options. If that's ttyS0 and there is no serial port connected, that breaks things. Swapping the order ensures /dev/console goes to tty1, and the boot process works with or without a serial port attached to the VM. If there is a serial port, the serial console will still work with this method.

ubuntu-server-builder commented 1 year ago

Launchpad user Alejandro Torras(atec-post) wrote on 2020-04-25T12:31:07.612192+00:00

Related bug:

ubuntu-server-builder commented 1 year ago

Launchpad user WGH(wgh) wrote on 2020-05-07T01:23:21.821417+00:00

I debugged this problem a bit. The problem stems from initramfs attempting to use /dev/console (which refers to nonexisting /dev/ttyS0), having its logging functions unexpectedly return errors, and broking everything around.

You may have already noticed that when this happens, 100% CPU time is consumed. If you enable sysrq keys with sysrq_always_enabled=1, and dump the task list (e.g. virsh send-key ubuntu18.04 KEY_LEFTALT KEY_SYSRQ KEY_T), you'll notice that there's always a combination of console_setup/loadkeys/setfont processes with evergrowing PIDs, which likely means that something is running them in tight loop.

Now, if you patch "panic()" in /usr/share/initramfs-tools/scripts/functions so it would print its argument to the console (echo "panic 1: " "$@" >/dev/kmsg), you'll see that the panic reason is that "filesystem on /dev/vda1 requires manual fsck", and it's printed in a loop. Indeed, the function does contain a loop:

checkfs() { while ! _checkfs_once "$@"; do panic "The $2 filesystem on $1 requires a manual fsck" done }

This is actually a bogus error. The filesystem is (most likely) fine. There's no fsck included in initramfs, so what happens is that the following fragment is executed:

    if ! command -v fsck >/dev/null 2>&1; then
            log_warning_msg "fsck not present, so skipping $NAME file system"
            return
    fi

log_warning_msg, however, returns non-zero status due to stdout being broken, which causes _checkfs_once return non-zero status as well.

panic doesn't work correctly either: it simply can't spawn a shell on broken /dev/console, and exits immediately, and that's what causes the infinite loop.

What I think about the solution.

First, debugging this is PITA. Adding a serial device might be a perfectly acceptable fix for many, but when this issue happens, absolutely nothing in the console points to the direction that this's what's missing. Even if it's necessary to leave ttyS0 as the main console, initramfs should at least warn the user (through kmsg) that /dev/console is broken.

Second, errors returned by logging function causing _checkfs_once return error as well is a bug. I think errors in _log_msg should be suppressed. If you do that, unless panic happens (which is rare), the boot will succeed.

Third, as Grant Emsley said, maybe ttyS0 doesn't really have to be the main console?

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-05-07T17:13:46.589752+00:00

The real fix here is kernel improvement (or bug fix if you want to consider current kernel behavior a bug). Anything else is just going to push around the failure.

That is what was determined in 2013, and its probably still true how. https://bugs.launchpad.net/ubuntu/+source/cloud-initramfs-tools/+bug/1123220/comments/8

ubuntu-server-builder commented 1 year ago

Launchpad user WGH(wgh) wrote on 2020-05-08T20:56:48.542436+00:00

Although I completely agree that's the kernel could've automatically chosen the working /dev/console "backend", and that would be the best fix, this won't be fixed soon. Right now users without serial port have unexplicable hang that is pretty hard to debug.

Having initramfs init script report broken /dev/console would help this situation tremendously, and the fix is very easy: just add

print "$@" || echo "/dev/console appears broken"

to _log_msg, and users will at least know the source of the problem.

ubuntu-server-builder commented 1 year ago

Launchpad user WGH(wgh) wrote on 2020-05-08T20:57:28.418539+00:00

I of course meant

print "$@" || echo "/dev/console appears broken" >/dev/kmsg

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-05-11T14:10:34.682761+00:00

@wgh, My experience is that it is unfortunately not that simple. It may have worked for you.

At the point in which it starts to fail, it repeatedly will fail. But up until some point, writes to stdout work fine. I believe this is because there is a buffer and it only begins failing when it has filled the buffer and tried to flush.

I have a script that I had put into the initramfs in one of the other bugs that shows this. Its quite possible that the behavior has changed in 8 years, but before you basically just had to write some amount of data to determine if it would fail.