Closed davidak closed 1 year ago
Why does it say 6.2.0 in your log? Are you actually running 6.2.6? Please confirm with uname -a
.
I am running 6.2.6 now, but before the reboot it was 6.2.0 (NixOS 22.11.2999.a7cc81913bb).
Could be already fixed in that version. I watch it.
Had a look at the log, that's a known issue. You have too little free RAM to suspend as suspending means evacuating all VRAM into system memory after the kernel allocator is already in NOIO mode thus cannot swap further. I worked around this with https://git.dolansoft.org/lorenz/memreserver which gets executed before sleep and forces the kernel to keep emough free memory around for supend to succeed. Adjust the amount of memory to your card VRAM +1GiB.
I should make this a NixOS module at some point.
The crash is due to the fact that amdgpu aborts the suspend, which leads to the kernel attempting s2idle which is not properly supported on this platform with an AMD GPU leading to the SMU failing.
that's a known issue. You have too little free RAM to suspend as suspending means evacuating all VRAM into system memory
i have 32 GB RAM, 32 GB SWAP and 8 GB VRAM. i see how that can be an issue in this context when RAM and VRAM are filled
i had reported a similar issue before and there where multiple fixes
https://gitlab.freedesktop.org/drm/amd/-/issues/2223 https://github.com/torvalds/linux/commit/8d4de331f1b24a22d18e3c6116aa25228cf54854 (in 6.1) https://github.com/systemd/systemd/issues/25151 is still open
AMD did fix that just attempting to go to S3 doesn't result in a GPU reset after TTM fails to evacuate the GPU VRAM, instead it aborts the S3 suspend attempt. But it still isn't able to suspend under memory pressure with a GPU with external VRAM because of some annoying design limitations on the PM subsystem (namely that there are no subsystem constraints and no phases, this has also bitten me on the storage/SCSI side). These limitations mean that it cannot perform writeback or swapping at the point where the VRAM eviction happens. Thus even with a lot of swap or nominally free memory being used as cache you end up with this issue. I'm running 64GiB RAM/96GiB swap and still had the same problem.
systemd just adds fuel to the fire by then attempting to go into s2idle which on AMD does a bunch of work which can result in issues on systems which aren't expected to go into s2idle, but on paper this behavior is acceptable. Crashing/hanging the GPU SMU by doing things not supported on the platform to it is technically on AMD.
I've been running my workaround for more than two years and never had any issues again. It essentially just installs a sleep hook which runs before the kernel actually suspends devices which allocates a bit more memory than the GPU has VRAM, forces the kernel to actually back it with real RAM (by locking it and writing zeroes to it) and then terminates. If the kernel cannot find real free RAM to back this allocation, it has to clear caches, do writebacks or swap out memory pages here while the system is still under normal operation. Then it terminates, leaving behind a large amount of truly free RAM which the kernel can then immediately use to evacuate VRAM into.
I'm thinking of spending some time to make this nice, i.e. a proper module which detects all GPUs with onboard VRAM (it's not needed for most APUs/notebook GPUs which share RAM), sums the amount, adds a fudge factor and then performs this without the need to manually configure anything.
I have an ancient desktop with an AMD dGPU (8GB VRAM) and a laptop with a ryzen 2 CPU (512M VRAM). I have experienced the black screen problem a lot on the laptop but only once or twice on the desktop.
I tried your memreserver with this package def - very hopeful that it does the trick because it’s annoying AF with the laptop.
{ lib , stdenv , fetchFromGitLab , gigaBytes ? 9 }:
stdenv.mkDerivation rec { pname = "memreserver"; version = "0.0.0.20200414";
src = fetchFromGitLab { domain = "git.dolansoft.org"; owner = "lorenz"; repo = pname; rev = "094963f0a90a6b059240ecc6fff9aeb8213e64cc"; hash = "sha256-wLHnOR+lgWFy0IdbQBKKA6HcMLejZHpfScNT9KDfSlw="; };
postPatch = '' substituteInPlace Makefile \ --replace /usr/local $out \ --replace /etc $out/lib
substituteInPlace main.c \
--replace 'amount = 5' 'amount = ${toString gigaBytes}' \
--replace ' 5G' ' ${toString gigaBytes}G'
substituteInPlace memreserver.service \
--replace /usr/local $out \
--replace ' 5G' ' ${toString gigaBytes}G'
'';
preInstall = '' mkdir -p $out/{bin,lib/systemd/system} '';
meta = with lib; { description = "Reserve memory for AMDGPU VRAM"; }; }
@peterhoeg AFAIK all Ryzen 2000-series mobile CPUs have an integrated GPU. Unless you also have a separate dedicated GPU on your notebook your issues have a different cause and will not be fixed by memreserver. Integrated GPUs do not have dedicated VRAM but rely on a slice of RAM shared with the CPU which is kept in self-refresh during S3 so it does not need to be evacuated, thus the problem cannot occur there.
So I've done some work on https://git.dolansoft.org/lorenz/memreserver, it now uses libdrm to dynamically determine the amount of RAM to be reserved as well as skipping the process if no GPU is found which requires this. It still only works for AMD GPUs as these are the only ones I've personally experienced the problem on, but I own no Intel dGPUs which are probably also affected.
Please test this improved version and report any issues. If it works out well, I need to rename it to something better (open to suggestions, I haven't found anything good yet) and make a NixOS module for it. Maybe we could even enable it automatically if amdgpu is in initrd.kernelModules
or initrd.availableKernelModules
, otherwise people need to know to turn the right knob to not get suspend failures.
EDIT: Here's a draft module, still without default enabling: https://github.com/lorenz/nixpkgs/commit/ff28634eb779b3b73b96812e8e477e7ed1d4a6ad
I have a similar issue, which might be related:
My laptop with Ryzen 6800H & RX 6850M XT fails to suspend.
After I run systemctl suspend
or sudo systemctl suspend
, my laptop screen goes blank after 0.5 seconds, then it wakes up automatically after 5 seconds & then it goes blank again after 5 seconds. But it still doesn't enter suspend (the power button & other LEDs remain solid instead of blinking).
Curiously, sudo pm-suspend
works without any issues. Even more curiously, PM_DEBUG=true sudo pm-suspend
has the same behavior as systemctl suspend
.
@utkarshgupta137 That's an unrelated kernel issue, please report it to https://gitlab.freedesktop.org/drm/amd
I'm inclined to close this issue since there is nothing we can really do about upstream issues. If it turns out @lorenz' workaround is effective at mitigating this issue, you can create a feature request for implementing it.
I have a similar issue, which might be related: My laptop with Ryzen 6800H & RX 6850M XT fails to suspend. After I run
systemctl suspend
orsudo systemctl suspend
, my laptop screen goes blank after 0.5 seconds, then it wakes up automatically after 5 seconds & then it goes blank again after 5 seconds. But it still doesn't enter suspend (the power button & other LEDs remain solid instead of blinking). Curiously,sudo pm-suspend
works without any issues. Even more curiously,PM_DEBUG=true sudo pm-suspend
has the same behavior assystemctl suspend
. Here is the dmesg output forsystemctl suspend
Here is the dmesg output forsudo pm-suspend
My issue was related to deep
sleep in /sys/power/mem_sleep
. My laptop didn't have deep
sleep option by default, but I was able to enable it using UMAF. Disabling it again solved the problem.
I had the issue again today on NixOS 22.11.4479.d4a9ff82fc1 with Linux 6.3.5 and created a bugreport upstream: https://gitlab.freedesktop.org/drm/amd/-/issues/2635
Still happens with Linux 6.7.6. New upstream issue: https://gitlab.freedesktop.org/drm/amd/-/issues/3208 I configured the memreserver and see how well it works.
@lorenz Thank you for your workaround, it's working very well for me so far!
@infinisil I'm glad it works well for you too, I should probably take some time to clean up/finish https://github.com/NixOS/nixpkgs/pull/225819 so it can be used out-of-the-box :)
Describe the bug
i suspended the computer yesterday and resumed today by hitting ENTER. the computer is on, but i only see a black screen, not even a cursor
Steps To Reproduce
Steps to reproduce the behavior:
Expected behavior
have image on screen
Screenshots
imagine an all black screenshot
Additional context
Similar to previous issues:
amdgpu crashed the kernel
full system log: amdgpu_crash.txt.zip
probably something upstream (amdgpu) has to fix
Notify maintainers
Metadata
"x86_64-linux"
Linux 6.2.6, NixOS, 22.11 (Raccoon), 22.11.3196.cd34d6ed7ba
yes
yes
nix-env (Nix) 2.11.1
"home-manager-22.11.tar.gz, nixos-22.11, nixos-hardware, nixos-unstable"
/nix/var/nix/profiles/per-user/root/channels/nixos