canonical / pc-gadget

The gadget snap for Personal Computers using 64bit Intel or AMD processors
29 stars 73 forks source link

console=ttyS0 is too slow and useless #48

Open xnox opened 4 years ago

xnox commented 4 years ago

console=ttyS0 is specified in the gadget by default, in UC20, for all modes: recovery, install, and run mode.

However, on the hardware that does not have serial console (majority of real x86 hardware) this option significantly delays the boot, as the kernel is polling for the serial console to appear, delaying the boot by 90s.

Furthermore if the serial console is present, the baud rate is not set to be high enough, resulting in painfully slow boots still.

I would like to drop serial console option from the pc gadget. If not completely, I can see the value of keeping it for the recover mode. Alternatively I think we should publish a separate serial pc gadget, that specifies only the serial console with a high baud rate.

Could we make console a grubenv paramenter? such that ubuntu-image / snap-prepare-image can modify it, and it would persist from install mode, to sealed secrets, run/recover modes?

Also see https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1879290

xnox commented 4 years ago

Or for example only enable and run console-conf on it, and not make kernel/journald slowly push messages to serial console delaying the whole boot.

anonymouse64 commented 4 years ago

Note that changes to the kernel command line should probably not be done immediately until snapd has support to read the kernel command line from from the gadget.yaml / otherwise because right now the snapd snap has the kernel command line we seal the TPM to hard-coded so changing this will break FDE.

I believe @bboozzoo was working on the feature to read the kernel command-line from the gadget, do you have a status update on that?

xnox commented 4 years ago

@anonymouse64 i am aware of the current duplication / disconnect of the gadget vs sealing code, so yeah will not push out an update to this uncoordinated.

ogra1 commented 4 years ago

all x86 IoT gateways i have touched yet (as well as most servers) do default to using a serial console ... dropping it completely doesn't smell like a good plan ...

xnox commented 4 years ago

as well as most servers

The current reference target for the pc gadget is Intel NUC, which does not have serial console by default.

Ubuntu Core does not target servers.

Can you please elaborate on the "IoT gateways" => can they run the stock PC gadget, or have custom ones? I thought they all have custom gadgets and do not use the reference PC gadget.

xnox commented 4 years ago

Also clouds may or may not have serial console, but they should be forking their own gadget anyway.

xnox commented 4 years ago

@anonymouse64 @bboozzoo if it helps, we can turn console=* values into a variable in either the stock grubenv file or a custom grubenv file i.e. install-settings.conf grubenv file which has like overrides for the consoles= to use/seal, and like cloud-init datasources to use/seal.

anonymouse64 commented 4 years ago

My 2¢ here is that we should probably:

I agree with @ogra1, it might be the case that the Intel NUC doesn't have a serial, but for example another IoT amd64 edge gateway we enabled UC16 for was the Dell gateways which do have physical serial ports.

ogra1 commented 4 years ago

Can you please elaborate on the "IoT gateways" => can they run the stock PC gadget, or have custom ones? I thought they all have custom gadgets and do not use the reference PC gadget.

well, the dell gateways (which admittedly currently come with a custom gadget) would be an example, all advantech ones i have touched yet. but also 90% of other Industrial PCs that might "just install" x86 focused images we provide on cdimage.

after all the typical IoT or industrial PC is often headless, yet an x86 base often means you can use an uncustomized image on them, unlike with arm devices where you can not have a generic image easily due to HW specific bootloaders.

EDIT: i mentioned servers simply because IoT GWs are typically a cut down server, not a cut down desktop ...

anonymouse64 commented 4 years ago

x86 base often means you can use an uncustomized image on them

servers simply because IoT GWs are typically a cut down server

This is precisely why I think we should leave serial on by default in the pc gadget so that folks can "test-drive" UC on their IoT devices by just flashing a released default image and login with console-conf via serial without needing to build their own gadget snap/image.

xnox commented 4 years ago

My 2¢ here is that we should probably:

  • always leave the serial console on for run mode and recover modes in the default gadget
  • always have the serial console go as fast as possible when enabled
  • make the kernel cmdline for using the serial console configurable

That will cost us a lot of boot time out of the box. Even "as fast as possible" is very slow. 30s+ of additional boot time.

Note, this is about dropping "console=" from the kernel command line to stop forcing kernel to slow down it's boot to the speed of being able to push kmsg to the serial console.

This is not about stopping/preventing consoleconf to run on serial consoles. By default it is spawned on them all.

xnox commented 4 years ago

EDIT: i mentioned servers simply because IoT GWs are typically a cut down server, not a cut down desktop ...

It's an embedded platform. Neither desktop or server. Because for something to be called a server, I expect 1TB of RAM, 1PT of NVME storage, RAID, infiniband, etc.

ogra1 commented 4 years ago

while console-conf will indeed still come up, are there not menu bits at the initrd level now that would also use the defined console= ?

indeed, if it is just kernel boot messages we lose thats completely neglectable and i'd fully agree with the removal, but AFAIK there are potentially interactive bits before systemd kicks in as well

anonymouse64 commented 4 years ago

This is not about stopping/preventing consoleconf to run on serial consoles. By default it is spawned on them all.

So w/o console=ttyS0 in the kernel commandline for run mode, what would the user experience be like? They plug in their device look at a blank serial console for ... however many minutes and then magically at some point console-conf shows up?

xnox commented 4 years ago

This is not about stopping/preventing consoleconf to run on serial consoles. By default it is spawned on them all.

So w/o console=ttyS0 in the kernel commandline for run mode, what would the user experience be like? They plug in their device look at a blank serial console for ... however many minutes and then magically at some point console-conf shows up?

Good question. Need to double check experimentally, I can record some videos.

Somehow it still feels wrong to have both enabled by default on any hardware. It almost feels more appropriate to detect console in grub, and if it is serial pass serial console to the kernel, if it's video pass video to the kernel.

xnox commented 4 years ago

This is not about stopping/preventing consoleconf to run on serial consoles. By default it is spawned on them all.

So w/o console=ttyS0 in the kernel commandline for run mode, what would the user experience be like? They plug in their device look at a blank serial console for ... however many minutes and then magically at some point console-conf shows up?

We know that today, the experience is of 30s+ hang with no output from the kernel, when waiting for serial to show up that does not exist. Because we force the kernel to look for one, when there isn't one.

anonymouse64 commented 4 years ago

We know that today, the experience is of 30s+ hang with no output from the kernel

Arguably this is a regression from UC18 -> UC20 in that there is a 30s+ hang with no output from the kernel on non-serial TTYs because the kernel is stuck trying to write to a non-existent serial TTY.

I'd hate to introduce what appears to be a a hang on serial TTYs just because we don't want what appears to be a hang on non-serial TTYs.

It almost feels more appropriate to detect console in grub, and if it is serial pass serial console to the kernel, if it's video pass video to the kernel.

This would be great but I don't know how we can do that while still enabling automatic FDE by sealing the kernel command-line against the TPM, unless both snapd + grub somehow learn to check if there are serial TTY's on the system, etc. Maybe there's a simpler solution I'm not aware of.

xnox commented 4 years ago

We know that today, the experience is of 30s+ hang with no output from the kernel

Arguably this is a regression from UC18 -> UC20 in that there is a 30s+ hang with no output from the kernel on non-serial TTYs because the kernel is stuck trying to write to a non-existent serial TTY.

I'd hate to introduce what appears to be a a hang on serial TTYs just because we don't want what appears to be a hang on non-serial TTYs.

Not a regression, UC18 also hangs in the same way.

It almost feels more appropriate to detect console in grub, and if it is serial pass serial console to the kernel, if it's video pass video to the kernel.

This would be great but I don't know how we can do that while still enabling automatic FDE by sealing the kernel command-line against the TPM, unless both snapd + grub somehow learn to check if there are serial TTY's on the system, etc. Maybe there's a simpler solution I'm not aware of.

As per original London sprint design, snapd must seal against the install-time dynamic cmdline and persist that through modes/kernel updates and resealings. That was the requirement of the original design. Currently snapd doesn't do resealing as far as I can tell, but it must support that.

jocado commented 3 years ago

Is there any movement or update on this ?

I'm particularly interested, as this causes an artificially long boot time on NUCs, and that eats into our Service Level budget on updates that require a reboot.

Also, now that snapd seems to be in control of the grub config, what is the recommended way to change the linux commandline ? Is it even possible from the gadget ?

Thanks.

anonymouse64 commented 3 years ago

As per original London sprint design, snapd must seal against the install-time dynamic cmdline and persist that through modes/kernel updates and resealings

I can't speak to the original London sprint design as I wasn't there and joined the project later, but the new plan is to have snapd dynamically generate the kernel command line that is to be used with sealing using the following things:

The last bit is what we are currently missing from snapd, which is a way for a gadget snap to specify additional kernel command line parameters. We have a rough plan and will implement it soon.

Also, now that snapd seems to be in control of the grub config, what is the recommended way to change the linux commandline ? Is it even possible from the gadget ?

Currently there is not a way to configure the command line without recompiling snapd. As mentioned, we will be working on a way to do this soon.

jocado commented 3 years ago

Thanks for the info.

Sounds like, if the only static config is panic=-1, serial console by default is being removed ?

anonymouse64 commented 3 years ago

Ah yes sorry I forgot to explain that too, what will happen is that currently actually panic=-1 and console=... settings are considered part of the static snapd config, but when we have the mechanism for gadgets to support setting additional kernel command line parameters, we will move setting console=... from inside snapd to the gadget, so that likely this default gadget snap published by Canonical will still support the serial console, but a fork of the Canonical gadget snap could easily remove that if desired.

jocado commented 3 years ago

Perfect - sounds good.

xnox commented 3 years ago

I have outstanding tasks to experiment with master serial console options, and/or speeding up the kernels serial console.

jocado commented 3 years ago

Hi.

Just wondering,= seeing as snapd 2.48 was supposed to be the target release for UC20, and it's at candidate stage, if there was any way I can test changing grub config via snapd 2.48 yet ?

jocado commented 3 years ago

Hi.

Just wondering, seeing as snapd 2.48 was supposed to be the target release for UC20, and it's at candidate stage, if there was any way I can test changing grub config via snapd 2.48 yet ?

anonymouse64 commented 3 years ago

@jocado the feature enabling gadget specified kernel command line options will not be in 2.48, it is still under very active development, but is getting much closer, for example see https://github.com/snapcore/snapd/pull/9724 and https://github.com/snapcore/snapd/pull/9719 which are getting us closer and closer to the final bits needed for this. It is unclear if we will backport those changes to 2.48 to be available in i.e. 2.48.1 or if the feature will just go into 2.49.

jocado commented 3 years ago

Hi @anonymouse64

Just checking in here to see if we are able yet, or have a good idea of when, to be able to disable the serial console args in the kernel commandline via the gadget.

Is it supported in snapd 2.49 which is currently in the beta channel ?

Thanks!

anonymouse64 commented 3 years ago

@jocado unfortunately no, 2.49 does not have the full set of changes yet, we will keep you updated on when the feature is enabled. Thanks for your patience.

jocado commented 3 years ago

It looks like we are very close now :)

https://forum.snapcraft.io/t/customising-uc20-kernel-command-line-arguments/24370

You will see comment from me there. I have tested and it's working for me with current edge revision.

@anonymouse64 Is there any rough release date for snapd-2.50 ?

anonymouse64 commented 3 years ago

@jocado snapd 2.50 is being released to stable as we speak, it is released in phases so not every device will get it at the same time, but by the current looks of it I think it should be 100% phased out within the next 24 hours

jocado commented 3 years ago

@anonymouse64 which revision will it be though ? As the feature only seemed to be working for me in the current edge channel.

2.50+git1692.g1286560 2021-05-12 (11995)

Candidate 11841 was not working as expected.

anonymouse64 commented 3 years ago

Revision 11841 is being released, can you detail in the forum post how the candidate channel didn't work for you?

jocado commented 3 years ago

It was simply that the cmdline.full was not respected, the default was still in use.

I did try and look around in logs etc, but I didn't see anything useful or obvious clues. I can add the contents of the cmdline.full , and any other details I can think of. I will do that tomorrow.

bboozzoo commented 3 years ago

It'd be interesting to see debug level logs. You can add snapd.debug=1 to the command line to enable debug logging in snapd.

Perhaps it's also useful to take a look at the spread test we have: https://github.com/snapcore/snapd/blob/master/tests/nested/manual/core20-custom-kernel-commandline/task.yaml the test repacks pc gadget and goes through cmdline.extra/cmdline.full variants.

bboozzoo commented 3 years ago

It was simply that the cmdline.full was not respected, the default was still in use.

I did try and look around in logs etc, but I didn't see anything useful or obvious clues. I can add the contents of the cmdline.full , and any other details I can think of. I will do that tomorrow.

BTW. have you installed the device from scratch maybe? snapd 2.50 carries an update to the boot script which supports cmdline.full, however, we decided to not bump the boot config version number, thus your current boot script will not get automatically updated.

jocado commented 3 years ago

I did install it from scratch yes. That is one of our common use cases currently.

What should I expect in that situation though ? It doesn't work from system bootstrap, but the works at some point int he future , next time the gadget is updated perhaps ?

bboozzoo commented 3 years ago

I did install it from scratch yes. That is one of our common use cases currently.

What should I expect in that situation though ? It doesn't work from system bootstrap, but the works at some point int he future , next time the gadget is updated perhaps ?

I'm looking into it right now. Looks like there's some mixup with what was cherry picked for 2.50. Some bits made it, but ones that glue everything together did not. I need to double check with @mvo5 but we may need to do 2.50.1.

In the meantime, can you try edge branch?

jocado commented 3 years ago

It worked 100% for me with the edge revision yesterday.

bboozzoo commented 3 years ago

It worked 100% for me with the edge revision yesterday.

That's good. When I have a branch for 2.50 ready, I'll add a link to it here. We build artifacts with the snapd snap as part of the workflow, you'll be able to grab it from there and verify.

jocado commented 3 years ago

Great - thank you :+1:

bboozzoo commented 3 years ago

The branch is up https://github.com/snapcore/snapd/pull/10265 AFAIK we haven't decided yet whether this will be in 2.50.

jocado commented 3 years ago

ok - thanks :crossed_fingers: - we are very keen for this feature :slightly_smiling_face:

bboozzoo commented 3 years ago

The tests have finished, and relevant ones were successful. When you click on the test workflow details, you should be able to access artifacts, which is a zip file with the snapd snap from that branch inside.

jocado commented 3 years ago

Hi. Sorry for the delayed response. Just to confirm, The artifact above seemed to work for me.

jocado commented 3 years ago

Just following on from last week, and the current revision that made it to current/stable, I presume we are looking at 2.50.1 now.

Not looking for absolutes, but is there any kind of rough ETA for that ? Are we talking weeks or months ?

jocado commented 3 years ago

Hi.

Can anyone confirm if we are looking at 2.50.1 for the above change, or is it a 2.51 change now [ which looks to be incoming anyway ].

anonymouse64 commented 3 years ago

2.50.1 has the fix and should be in stable now, but yes 2.51 also has the full fix and should be headed to stable next week hopefully.