NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.31k stars 13.54k forks source link

black/glitchy screen upon boot since https://github.com/NixOS/nixpkgs/pull/279789 with amd+nvidia GPUs in laptop #330379

Closed thibautbenjamin closed 1 month ago

thibautbenjamin commented 1 month ago

Describe the bug

After a system update, my laptop does not boot anymore. At some point during the booting process, the screen becomes black and I get visual glitches, like some glitchy horizontal green lines. From there, there does not seem to be anything I can do, like dropping in a tty and I could only do a hard shutdown.

Steps To Reproduce

Update system to the most recent state of nixos-unstable.

Expected behavior

System boots normally

Additional context

After a bit of testing and bisecting, I could pin down the exact commit where the issue starts appearing: 7c3815ab71cf6819c0aefb7eba4b4f2be2e85997. I have currently pinned the input to the previous commit, where I do not encounter the issue.

I am trying to run this on a Lenovo legion laptop, with dual gpu AMD/Nvidia. It also runs the mediatek mt7921e for the wifi card. I have little knowledge on kernel/low level stuff, but I am happy to help, give more info and do more testing with some guidance.

Metadata

This is the metadata for the system pinned to the previous commit, on which I can boot:

nix-shell -p nix-info --run "nix-info -m"

 - system: `"x86_64-linux"`
 - host os: `Linux 6.10.0, NixOS, 24.11 (Vicuna), 24.11.20240718.4ede20c`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.5`
 - nixpkgs: `/nix/store/nskn5gl69hpp5cwanr4hkli8jzplqaxh-source`

Add a :+1: reaction to issues you find important.

Ai-Elias commented 1 month ago

Could this be related to this discussion? I am getting a similar experience when updating my Nvidia system to 24.05 and when booting the official Gnome and KDE live images.

https://discourse.nixos.org/t/nixos-24-05-graphical-installation-iso-crashes-after-selecting-a-boot-option/47614/16

Atemu commented 1 month ago

This does not sound like a boot failure but rather a GPU failure. Try sshing into your machine or navigating it blindly to verify whether it booted successfully otherwise. Also observe the journal from boot attempts made with the broken config and post the relevant sections.

cc @nazarewk @K900

msfjarvis commented 1 month ago

I'm also having this issue on a PC with an AMD CPU (w/ integrated graphics) and an NVIDIA RTX 4070. I'll try to get a live USB going so I can chroot and grab the logs.

nazarewk commented 1 month ago

@thibautbenjamin @msfjarvis

This sounds exactly like the issue which started popping up in late March 2024 and got resolved in nixos-unstable by 7c3815a (#279789)

Can you somehow try to integrating the commit it into your config (building the kernel will take some time) and confirm whether it works?

nazarewk commented 1 month ago

Sorry for the confusion, above might be the reason for 24.05 issues which does not have #279789 merged in, but this issue mentiones 24.11 so unstable most likely

I tried to gather some discussion giving a clue on what might be happening and why, but I'm not the one who found the "fix":

CC @jys1670 , who came up with the FW_LOADER fix

msfjarvis commented 1 month ago

This sounds exactly like the issue which started popping up in late March 2024 and got resolved in nixos-unstable by 7c3815a (#279789)

Can you somehow try to integrating the commit it into your config (building the kernel will take some time) and confirm whether it works?

I'm having trouble with DNS resolution in the chroot when using nixos-enter to try and build a new generation, so I can't try this right away. If/when I can sort that out I'll give this a go.

msfjarvis commented 1 month ago

Reverting #279789 did not resolve the issue for me so I'm doing my own bisect now.

msfjarvis commented 1 month ago

Some rather painful bisection later it appears my issue is not related to this after all, and is instead caused by a GNOME extension update from https://github.com/NixOS/nixpkgs/pull/325257. I'll direct my investigations there, apologies for the noise 🙇‍♂️

nazarewk commented 1 month ago

Some rather painful bisection later it appears my issue is not related to this after all, and is instead caused by a GNOME extension update from #325257. I'll direct my investigations there, apologies for the noise 🙇‍♂️

I would not hide this comment :)

thibautbenjamin commented 1 month ago

Sorry for the confusion, above might be the reason for 24.05 issues which does not have #279789 merged in, but this issue mentiones 24.11 so unstable most likely

I tried to gather some discussion giving a clue on what might be happening and why, but I'm not the one who found the "fix":

* [nixos/hardware.display: init module #279789 (comment)](https://github.com/NixOS/nixpkgs/pull/279789#issuecomment-2106205288)

* [nixos/hardware.display: init module #279789 (comment)](https://github.com/NixOS/nixpkgs/pull/279789#issuecomment-2148560802)

* https://discourse.nixos.org/t/copying-custom-edid/31593/32

CC @jys1670 , who came up with the FW_LOADER fix

Thanks a lot for your quick reply. I stumbled upon the fix later during the weekend but did not have time to build the full kernel myself. Doing that just now and will let you know if this solves my issues

nazarewk commented 1 month ago

Thanks a lot for your quick reply. I stumbled upon the fix later during the weekend but did not have time to build the full kernel myself. Doing that just now and will let you know if this solves my issues

I'm pretty sure you already have the fix, because it is part of nixos-unstable now and by the sound of it your issues are caused by it.

thibautbenjamin commented 1 month ago

Sorry, I think you misunderstood my issue a little bit: my laptop is working fine on any commit before 7c3815a (#279789), but this commit is what makes the issue appear for me.

I just did the kernel rebuild now, and did not solve the issue (expected if I understand the situation correctly).

However something weird happened: I changed my config so that I get logged in directly in my user on boot (I was using gdm before). With this change, I noticed that my laptop indeed boots properly, as my external monitor is working properly and display my screen as usual.

The screen of my laptop however is still completely black and buggy, so it does appear to be a GPU issue. Juming to a tty does not solve the issue, the primary screen is still bugged out despite not using a graphical environment.But this means that I have now access to the system while it's broken and can display some logs.

I ran journalctl -xe | grep error, which only gave me the following output, which I believe is unrelated:

Jul 29 20:16:52 nixos udiskie[3399]: gi.repository.GLib.GError: g-dbus-error-quark: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.freedesktop.UDisks2 was not provided by any .service files (2)

I had noticed previously an edid related error in journalctl which seems to have disappeared with the aforementioned fix. However, my laptop was working fine with the issue before the fix and is not working with the fix, despite the error message being gone.

I don't have much experience dealing with this stuff, but I would be happy to help.

(I am guessing that as a temporary workaround I can probably use the suggested fix and replace the yes by a no, so I'll try that now, given that I can jump into the version that broken on my end quickly as it is cached)

nazarewk commented 1 month ago

However something weird happened: I changed my config so that I get logged in directly in my user on boot (I was using gdm before). With this change, I noticed that my laptop indeed boots properly, as my external monitor is working properly and display my screen as usual.

The screen of my laptop however is still completely black and buggy, so it does appear to be a GPU issue. Juming to a tty does not solve the issue, the primary screen is still bugged out despite not using a graphical environment.But this means that I have now access to the system while it's broken and can display some logs.

I ran journalctl -xe | grep error, which only gave me the following output, which I believe is unrelated:

Jul 29 20:16:52 nixos udiskie[3399]: gi.repository.GLib.GError: g-dbus-error-quark: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.freedesktop.UDisks2 was not provided by any .service files (2)

could it be https://github.com/NixOS/nixpkgs/issues/330379#issuecomment-2256550303 ?

thibautbenjamin commented 1 month ago

I don't think this is the same issue, since it persists even without gdm.

Also, I did a kernel rebuild reverting the fix, and it works now. I have this in my config:

  boot = {
    kernelPatches = [ {
      name = "edid-loader-fix-config";
      patch = null;
      extraConfig = ''
                  FW_LOADER m
                  '';
    } ];
  };

and it fixes my issue.

This is not the best solution as it forces me to rebuild the kernel, but I can do with it as a temporary fix. Still happy to help on any report or feedback I can provide

jys1670 commented 1 month ago

I found FW_LOADER solution through a comparison of NixOS and Arch kernel configurations (and plenty of rebuilds 🥲). Arch has FW_LOADER set to yes, so this issue is probably about NixOS lacking something in config. Could you try a kernel with an independent configuration, like linux_cachyos from Nyx? Make sure to add the cache first to avoid rebuilds. Also, there are some special patches for nvidia in Arch: https://github.com/NixOS/nixpkgs/pull/279789#issuecomment-2108672195. It could be related. Either way, I can't really help much here, since I don't have amd + nvidia machine, and intel + nvidia muxless laptop works fine for me (both sync/offload modes with proprietary driver on KDE Plasma).

thibautbenjamin commented 1 month ago

Thanks for your suggestion. I'll test that during the week-end when I have time

thibautbenjamin commented 1 month ago

Ok, I figured it out! I use the nixos-hardware repo to get all the config stuff specific to my laptop, and until now I was accidentally using a a slightly different version (the one for the gen just after mine I think). It turns out, that this thing is setting an edid file, that's required for this other version, but messes up my system.

Now because of the issue that was preventing the edid files to load, it turns out that this wasn't a problem in the end, since the faulty edid file was simply not loaded. But fixing the issue had for me the side effect of now loading the faulty edid file, causing the issue on my end.

So in the end, I fixed the mistake in my config and now pull the right version of the hardware I have, and that solved the issue. I'll mark the issue as resolved.