Open ncfavier opened 1 year ago
Shouldn't udev be well prepared to handle these module load events, as it is the Dynamic device management daemon after all, maybe just tweaking the udev rule a bit to trigger 90-vconsole.rules
again after the module load is the right fix. Still I'm curious about the design behind all these: why we need both a service and a udev rule to get the console right.
maybe just tweaking the udev rule a bit to trigger
90-vconsole.rules
again after the module load is the right fix
That would be great, but I can't figure it out: looking at udev debug logs, the only event emitted for vtcon0
is the initial add
. I tried adding the same rule for the amdgpu
device and for fb*
devices, but in both cases systemd-vconsole-setup
fails.
why we need both a service and a udev rule to get the console right
I think the udev rule is for initialising the vconsole as soon as it appears, while the service is for reconfiguring it when changes are made to vconsole.conf.
Searching through the issues led me to this: https://github.com/systemd/systemd/issues/2612
Well, I don't see how this could ever work: we simply don't know whether there will be another KMS driver showing up or not. Device probing is full async, hence it might appear any time, and there's no point in time where we know everything has shown up. Hence wey cannot delay vconsole accordingly.
However it also mentions:
in systemd 231-232 we reworked the vconsole font setup process, fixing various issues
While not pointing out what the reworked process looks like,
I saw that, it's rather old, I'm pretty sure the rework is https://github.com/systemd/systemd/pull/3742, which is just the current state of things as we know it.
If that's the rework, it does not seem to address the race conditions, or does it?
At least not the one we care about now.
Per the information I have had, the vconsole setup issue is a won't fix, due to the natural of async device probing, imaging the scenario that a particular mother board powering up the GPU halfway through the boot process, after systemd-modules-load.service
, whatever ordering of services won't be able to handle this. The mysterious scrolling requires further investigation though.
If there has to be a fix, I think it's the kernel VT infrastructure's responsibility to either persist the vconsole states across the driver load, or inform udev of the change in order to trigger a reload.
For what it's worth, #210205 should improve the console font situation for hidpi users by relying on in-kernel font selection instead of loading it from userspace. However it doesn't help those who want to customize the font or need a different one for i18n.
I've had a quick look at the fbcon driver, and nothing in there suggests that a resolution change would reset the font to the default. My current working theory is that when loading the gpu driver, a whole new instance of fbcon takes over the existing vt, so all existing state is lost. I don't know how practical it would be for the kernel to persist that on its end, as opposed to notifying userspace of the change. I think it would be best to start by sending a message to the kernel mailing list and see what they have to say about this.
Also, this doesn't just affect systemd in initrd, the same problem exists when loading GPU drivers in stage 2, and I can also reproduce it on Arch.
sending a message to the kernel mailing list
Do you want to take care of it? You seem more competent than I am.
I would, but I'm very overwhelmed with other things right now so I'm not really sure that I can give this proper attention.
I found that the systemd-vconsole-setup is called by the udev rules only but does not include the systemd-vconsole-setup.service. The solution would be to include this service to the initrd and add the systemd-modules-load.service as a dependency.
That's incorrect as far as I know. The udev rule exists for event-driven initialization and it is working as intended. This is a kernel problem and it would be much more useful to fix the root cause rather than come up with bandaid workarounds.
I actually ended up posting to lkml about this because apparently I'm hyperfixating on this anyway.
https://lore.kernel.org/all/CANnEQ3Ef5-XRSVL=RCBuKKhR0oZF+SO2BSSiBigZOyjMeQ7f_g@mail.gmail.com/
Unfortunately I never got a reply. :( Maybe I should try again.
Interesting find, I wonder if that commit helps or makes things worse.
Edit: looks like no difference. That revert is present in 6.2.7, but I'm still seeing the same problem. So that change really just removed code that didn't work.
Here's the same patch in mainline: https://github.com/torvalds/linux/commit/12d5796d55f9fd9e4b621003127c99e176665064.
It would be helpful to have a minimal VM repro of this (using virtio-gpu maybe?).
Also here's my understanding of what's happening and what needs to change: during boot, we initially see efifb and the first fbcon instance. Then when the GPU driver loads, a new instance of fbcon spawns on the new framebuffer driver, and takes over the existing vtcon. This discards the font, and because a new vtcon isn't being added, no event hits udev.
One way to work around this problem is to pass the quiet
option to the kernel. In that case, the vtcon isn't initialized until fairly late in the boot process, often enough long after the GPU driver has loaded, so the font doesn't get disrupted.
So what needs to happen is the kernel should emit a "change" event whenever a new backend driver takes over a vtcon.
The systemd-based initrd parallelises the early boot process, which is good except that it removes the ordering between loading kernel modules (
systemd-modules-load.service
) and udev (systemd-udevd.service
). This causes issues for me because my graphics driver (amdgpu
) takes about 3 seconds to load, at which point:90-vconsole.rules
udev rule is reset to the default, and only reset correctly much later, whenreload-systemd-vconsole-setup.service
is pulled bymulti-user.target
;Setting
fixes both issues.
If this turns out to be the correct solution to this problem then it should be taken upstream, but I wanted to ask opinions here first.
There is also the alternative of considering issues 1 and 2 as bugs (with Linux? amdgpu?) and trying to fix them, which might take a long time.
cc @oxalica @ElvishJerricco @NickCao @lheckemann @dasJ