NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.93k stars 13.96k forks source link

systemd initrd loads graphics driver too late #202846

Open ncfavier opened 1 year ago

ncfavier commented 1 year ago

The systemd-based initrd parallelises the early boot process, which is good except that it removes the ordering between loading kernel modules (systemd-modules-load.service) and udev (systemd-udevd.service). This causes issues for me because my graphics driver (amdgpu) takes about 3 seconds to load, at which point:

  1. the console font set by the 90-vconsole.rules udev rule is reset to the default, and only reset correctly much later, when reload-systemd-vconsole-setup.service is pulled by multi-user.target;
  2. as reported here, the console gets scrolled up a few lines while the LUKS password prompt stays floating in the middle of the screen. I don't know if this is a bug with amdgpu, Linux, systemd, or how to even begin troubleshooting it.

Setting

{ boot.initrd.systemd.services.systemd-udevd.after = [ "systemd-modules-load.service" ]; }

fixes both issues.

If this turns out to be the correct solution to this problem then it should be taken upstream, but I wanted to ask opinions here first.

There is also the alternative of considering issues 1 and 2 as bugs (with Linux? amdgpu?) and trying to fix them, which might take a long time.

cc @oxalica @ElvishJerricco @NickCao @lheckemann @dasJ

NickCao commented 1 year ago

Shouldn't udev be well prepared to handle these module load events, as it is the Dynamic device management daemon after all, maybe just tweaking the udev rule a bit to trigger 90-vconsole.rules again after the module load is the right fix. Still I'm curious about the design behind all these: why we need both a service and a udev rule to get the console right.

ncfavier commented 1 year ago

maybe just tweaking the udev rule a bit to trigger 90-vconsole.rules again after the module load is the right fix

That would be great, but I can't figure it out: looking at udev debug logs, the only event emitted for vtcon0 is the initial add. I tried adding the same rule for the amdgpu device and for fb* devices, but in both cases systemd-vconsole-setup fails.

why we need both a service and a udev rule to get the console right

I think the udev rule is for initialising the vconsole as soon as it appears, while the service is for reconfiguring it when changes are made to vconsole.conf.

NickCao commented 1 year ago

Searching through the issues led me to this: https://github.com/systemd/systemd/issues/2612

Well, I don't see how this could ever work: we simply don't know whether there will be another KMS driver showing up or not. Device probing is full async, hence it might appear any time, and there's no point in time where we know everything has shown up. Hence wey cannot delay vconsole accordingly.

However it also mentions:

in systemd 231-232 we reworked the vconsole font setup process, fixing various issues

While not pointing out what the reworked process looks like,

ncfavier commented 1 year ago

I saw that, it's rather old, I'm pretty sure the rework is https://github.com/systemd/systemd/pull/3742, which is just the current state of things as we know it.

NickCao commented 1 year ago

If that's the rework, it does not seem to address the race conditions, or does it?

ncfavier commented 1 year ago

At least not the one we care about now.

NickCao commented 1 year ago

Per the information I have had, the vconsole setup issue is a won't fix, due to the natural of async device probing, imaging the scenario that a particular mother board powering up the GPU halfway through the boot process, after systemd-modules-load.service, whatever ordering of services won't be able to handle this. The mysterious scrolling requires further investigation though.

NickCao commented 1 year ago

If there has to be a fix, I think it's the kernel VT infrastructure's responsibility to either persist the vconsole states across the driver load, or inform udev of the change in order to trigger a reload.

9ary commented 1 year ago

For what it's worth, #210205 should improve the console font situation for hidpi users by relying on in-kernel font selection instead of loading it from userspace. However it doesn't help those who want to customize the font or need a different one for i18n.

I've had a quick look at the fbcon driver, and nothing in there suggests that a resolution change would reset the font to the default. My current working theory is that when loading the gpu driver, a whole new instance of fbcon takes over the existing vt, so all existing state is lost. I don't know how practical it would be for the kernel to persist that on its end, as opposed to notifying userspace of the change. I think it would be best to start by sending a message to the kernel mailing list and see what they have to say about this.

Also, this doesn't just affect systemd in initrd, the same problem exists when loading GPU drivers in stage 2, and I can also reproduce it on Arch.

ncfavier commented 1 year ago

sending a message to the kernel mailing list

Do you want to take care of it? You seem more competent than I am.

9ary commented 1 year ago

I would, but I'm very overwhelmed with other things right now so I'm not really sure that I can give this proper attention.

rikkaneko commented 1 year ago

I found that the systemd-vconsole-setup is called by the udev rules only but does not include the systemd-vconsole-setup.service. The solution would be to include this service to the initrd and add the systemd-modules-load.service as a dependency.

9ary commented 1 year ago

That's incorrect as far as I know. The udev rule exists for event-driven initialization and it is working as intended. This is a kernel problem and it would be much more useful to fix the root cause rather than come up with bandaid workarounds.

9ary commented 1 year ago

I actually ended up posting to lkml about this because apparently I'm hyperfixating on this anyway.

https://lore.kernel.org/all/CANnEQ3Ef5-XRSVL=RCBuKKhR0oZF+SO2BSSiBigZOyjMeQ7f_g@mail.gmail.com/

9ary commented 1 year ago

Unfortunately I never got a reply. :( Maybe I should try again.

ncfavier commented 1 year ago

Maybe relevant?: https://lore.kernel.org/all/20230227020855.1051605-8-sashal@kernel.org/

9ary commented 1 year ago

Interesting find, I wonder if that commit helps or makes things worse.

Edit: looks like no difference. That revert is present in 6.2.7, but I'm still seeing the same problem. So that change really just removed code that didn't work.

Here's the same patch in mainline: https://github.com/torvalds/linux/commit/12d5796d55f9fd9e4b621003127c99e176665064.

9ary commented 1 year ago

It would be helpful to have a minimal VM repro of this (using virtio-gpu maybe?).

Also here's my understanding of what's happening and what needs to change: during boot, we initially see efifb and the first fbcon instance. Then when the GPU driver loads, a new instance of fbcon spawns on the new framebuffer driver, and takes over the existing vtcon. This discards the font, and because a new vtcon isn't being added, no event hits udev.

One way to work around this problem is to pass the quiet option to the kernel. In that case, the vtcon isn't initialized until fairly late in the boot process, often enough long after the GPU driver has loaded, so the font doesn't get disrupted.

So what needs to happen is the kernel should emit a "change" event whenever a new backend driver takes over a vtcon.