NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.44k stars 13.64k forks source link

Radeon graphics ring stalled issue #39573

Open elaOnMars opened 6 years ago

elaOnMars commented 6 years ago

Issue description

A nearly new NixOS system produces Radeon graphics ring 0 stalls in tty1. I have only installed only a few packages and adjusted the default configuration.nix slightly.

Maybe a missing driver?

... [72636.650438] radeon 0000:01:00.0: ring 0 stalled for more than 15551020msec [73618.503749] radeon 0000:01:00.0: ring 0 stalled for more than 16532828msec [73619.007766] radeon 0000:01:00.0: ring 0 stalled for more than 16533332msec [73619.511789] radeon 0000:01:00.0: ring 0 stalled for more than 16533836msec [73620.015807] radeon 0000:01:00.0: ring 0 stalled for more than 16534340msec [73620.519850] radeon 0000:01:00.0: ring 0 stalled for more than 16534844msec ...

UPDATE: This happens also on Archlinux as I have observed yesterday.

Steps to reproduce

I don't know how to trigger those errors.

Technical details

Related issues?

https://github.com/NixOS/nixpkgs/issues/30188 (closed) https://github.com/NixOS/nixpkgs/issues/31154 (closed)

vcunat commented 6 years ago

It might be worth to try the "amdgpu" driver, especially if it's a newer hardware.

elaOnMars commented 6 years ago

The problem has occurred after resuming. Between the stalled errors were the following messages:

drm: r600_ring_test [radeon]
drm: r600_resume [radeon]

On the Archlinux Wiki I have read, that "with the radeon driver, power saving is disabled by default and has to be enabled manually if desired. " [https://wiki.archlinux.org/index.php/ATI]

With this in mind installing the "amdgpu" driver would probably not resolve the issue. As I understand the Archlinux Wiki, radeon.dpm=1 must be added to the kernel parameters too.

[[Update: Adding radeon.dpm=1 to the kernelParams in hardware-configuration.nix has produced a black screen. --> not the solution]]

I'm unsure how to add those parameters to the nix configuration and hardware files.

To check the power saving settings of the card, use:

$ cat  /sys/class/drm/card0/device/power_profile
default

As documented here [https://www.x.org/wiki/RadeonFeature/#index3h2 ], "\"default\" uses the default clocks and does not change the power state. This is the default behavior."

I'm not sure how to change the content of /sys/class/drm/card0/device/power_profile by default with the nix files.

vcunat commented 6 years ago

Extra kernel parameters can be put into boot.kernelParams in your configuration.nix.

vcunat commented 6 years ago

amdgpu is just a different driver (newer design), so it has a bit different set of issues.

elaOnMars commented 6 years ago

vcuncat: It seems that installing xf86-video-amdgpu has maybe fixed this issue... I'll monitor it because "ringing" occured late and only after a couple of reboots.

I've edited my text above regarding the idea to add radeon.dpm=1 to the kernelParams which had produced a black screen.

stale[bot] commented 4 years ago

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.
gepbird commented 3 weeks ago

Still relevant. I'm experiencing this with default drivers using AMD FX-7500 Radeon R7 on nixos-unstable for months.

Usually I get this after resuming from hibernation, but recently I got it while actively using the device.

[ 1276.432226] radeon 0000:00:01.0: ring 0 stalled for more than 10271msec
[ 1276.432248] radeon 0000:00:01.0: GPU lockup (current fence id 0x0000000000026a51 last fence id 0x0000000000026a93 on ring 0)

Full kernel log

I'm going to try the amdgpu driver.