ValveSoftware / SteamOS

SteamOS community tracker
1.55k stars 70 forks source link

amdgpu crash with 3.5.11 #1312

Open unclejack opened 9 months ago

unclejack commented 9 months ago

Your system information

Please describe your issue in as much detail as possible:

I expected gamescope and the gpu driver to not crash.

What happened:

Dec 00 00:44:53 steamdeck fancontrol.py[577]: Warning: CPU temperature of 94.0 greater than max 90! Setting fan to max speed.
Dec 00 00:44:54 steamdeck fancontrol.py[577]: Warning: CPU temperature of 92.2 greater than max 90! Setting fan to max speed.
Dec 00 00:44:55 steamdeck fancontrol.py[577]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 00 00:45:21 steamdeck fancontrol.py[577]: Warning: CPU temperature of 90.6 greater than max 90! Setting fan to max speed.
Dec 00 00:45:45 steamdeck dbus-daemon[572]: [system] Activating via systemd: service name='org.freedesktop.home1' unit='dbus-org.freedesktop.home1.service' requested by ':1.172' (uid=0 pid=5783 comm="sudo -s")
Dec 00 00:45:45 steamdeck dbus-daemon[572]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.home1.service': Unit dbus-org.freedesktop.home1.service not found.
Dec 00 00:48:19 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=35171, emitted seq=35175
Dec 00 00:48:19 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Dec 00 00:48:19 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 00 00:48:19 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 00 00:48:19 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 00 00:48:19 steamdeck kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
Dec 00 00:48:19 steamdeck kernel: [drm] PSP is resuming...
Dec 00 00:48:19 steamdeck (udev-worker)[5882]: devcd1: Process 'cat /sys/devices/virtual/devcoredump/devcd1/data > /var/lib/steamos-log-submitter/pending/devcoredump/4785' failed with exit code 1.
Dec 00 00:48:19 steamdeck kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
Dec 00 00:48:19 steamdeck fancontrol.py[577]: Traceback (most recent call last):
Dec 00 00:48:19 steamdeck fancontrol.py[577]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 542, in <module>
Dec 00 00:48:19 steamdeck fancontrol.py[577]:     controller.loop_control()
Dec 00 00:48:19 steamdeck fancontrol.py[577]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 486, in loop_control
Dec 00 00:48:19 steamdeck fancontrol.py[577]:     self.loop_read_sensors()
Dec 00 00:48:19 steamdeck fancontrol.py[577]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 452, in loop_read_sensors
Dec 00 00:48:19 steamdeck fancontrol.py[577]:     self.power_sensor.get_avg_value()
Dec 00 00:48:19 steamdeck fancontrol.py[577]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 356, in get_avg_value
Dec 00 00:48:19 steamdeck fancontrol.py[577]:     self.values.append(self.get_value())
Dec 00 00:48:19 steamdeck fancontrol.py[577]:                        ^^^^^^^^^^^^^^^^
Dec 00 00:48:19 steamdeck fancontrol.py[577]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 351, in get_value
Dec 00 00:48:19 steamdeck fancontrol.py[577]:     self.value = int(f.read().strip()) / 1000000
Dec 00 00:48:19 steamdeck fancontrol.py[577]:                      ^^^^^^^^
Dec 00 00:48:19 steamdeck fancontrol.py[577]: PermissionError: [Errno 1] Operation not permitted
Dec 00 00:48:19 steamdeck systemd[1]: jupiter-fan-control.service: Main process exited, code=exited, status=1/FAILURE
Dec 00 00:48:19 steamdeck fancontrol.py[5887]: loaded critical temp from SSD hwmon: 79.85
Dec 00 00:48:19 steamdeck fancontrol.py[5887]: returning fan to EC control loop
Dec 00 00:48:19 steamdeck systemd[1]: jupiter-fan-control.service: Failed with result 'exit-code'.
Dec 00 00:48:19 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 15.142s CPU time.
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
Dec 00 00:48:20 steamdeck kernel: [drm] DMUB hardware initialized: version=0x0300000A
Dec 00 00:48:20 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 00 00:48:20 steamdeck kernel: [drm] kiq ring mec 2 pipe 1 q 0
Dec 00 00:48:20 steamdeck kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Dec 00 00:48:20 steamdeck kernel: [drm] JPEG decode initialized successfully.
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
Dec 00 00:48:20 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
Dec 00 00:48:20 steamdeck systemd[1]: jupiter-fan-control.service: Scheduled restart job, restart counter is at 1.
Dec 00 00:48:20 steamdeck systemd[1]: Stopped Jupiter fan control.
Dec 00 00:48:20 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 15.142s CPU time.
Dec 00 00:48:20 steamdeck systemd[1]: Started Jupiter fan control.
Dec 00 00:48:21 steamdeck fancontrol.py[5897]: loaded critical temp from SSD hwmon: 79.85
Dec 00 00:48:21 steamdeck fancontrol.py[5897]: jupiter-fan-control started successfully.
Dec 00 00:49:25 steamdeck dbus-daemon[572]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.174' (uid=1000 pid=5944 comm="timedatectl status")
Dec 00 00:49:25 steamdeck systemd[1]: Starting Time & Date Service...
Dec 00 00:49:25 steamdeck dbus-daemon[572]: [system] Successfully activated service 'org.freedesktop.timedate1'
Dec 00 00:49:25 steamdeck systemd[1]: Started Time & Date Service.
Dec 00 00:49:35 steamdeck systemd[1]: Created slice Slice /system/systemd-coredump.
Dec 00 00:49:35 steamdeck systemd[1]: Started Process Core Dump (PID 5980/UID 0).
Dec 00 00:49:35 steamdeck core_handler[5981]: Minidump generated at /var/lib/steamos-log-submitter/pending/minidump/.staging-1702676974-gamescope-xwm-3325-None.dmp
Dec 00 00:49:35 steamdeck kernel: input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb3/3-3/3-3:1.2/0003:28DE:1205.0003/input/input35
Dec 00 00:49:35 steamdeck systemd-coredump[5982]: Process 3325 (gamescope-wl) of user 1000 dumped core.

                                                  Stack trace of thread 3360:
                                                  #0  0x00007f9d5589f26c n/a (libc.so.6 + 0x8926c)
                                                  #1  0x00007f9d5584fa08 raise (libc.so.6 + 0x39a08)
                                                  #2  0x00007f9d55838538 abort (libc.so.6 + 0x22538)
                                                  #3  0x00007f9d5583845c n/a (libc.so.6 + 0x2245c)
                                                  #4  0x00007f9d558483d6 __assert_fail (libc.so.6 + 0x323d6)
                                                  #5  0x0000561db0f8cd97 n/a (gamescope + 0x7fd97)
                                                  #6  0x0000561db0f960ca n/a (gamescope + 0x890ca)
                                                  #7  0x0000561db0f658a0 n/a (gamescope + 0x588a0)
                                                  #8  0x0000561db0f67b3f n/a (gamescope + 0x5ab3f)
                                                  #9  0x0000561db0f82fac n/a (gamescope + 0x75fac)
                                                  #10 0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                  #11 0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #12 0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3325:
                                                  #0  0x00007f9d55913c0f __poll (libc.so.6 + 0xfdc0f)
                                                  #1  0x0000561db0f8555f n/a (gamescope + 0x7855f)
                                                  #2  0x0000561db0f2f446 n/a (gamescope + 0x22446)
                                                  #3  0x00007f9d55839850 n/a (libc.so.6 + 0x23850)
                                                  #4  0x00007f9d5583990a __libc_start_main (libc.so.6 + 0x2390a)
                                                  #5  0x0000561db0f51555 n/a (gamescope + 0x44555)

                                                  Stack trace of thread 3326:
                                                  #0  0x00007f9d55921266 epoll_wait (libc.so.6 + 0x10b266)
                                                  #1  0x0000561db0f73bcf n/a (gamescope + 0x66bcf)
                                                  #2  0x0000561db0f77424 n/a (gamescope + 0x6a424)
                                                  #3  0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                  #4  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #5  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3328:
                                                  #0  0x00007f9d55913c0f __poll (libc.so.6 + 0xfdc0f)
                                                  #1  0x0000561db0f84987 n/a (gamescope + 0x77987)
                                                  #2  0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                  #3  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #4  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3330:
                                                  #0  0x00007f9d558e59e5 clock_nanosleep (libc.so.6 + 0xcf9e5)
                                                  #1  0x00007f9d558ea5e7 __nanosleep (libc.so.6 + 0xd45e7)
                                                  #2  0x00007f9d54100455 n/a (libvulkan_radeon.so + 0x100455)
                                                  #3  0x00007f9d5425c7cc n/a (libvulkan_radeon.so + 0x25c7cc)
                                                  #4  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #5  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3359:
                                                  #0  0x00007f9d55913c0f __poll (libc.so.6 + 0xfdc0f)
                                                  #1  0x0000561db0fa99b2 n/a (gamescope + 0x9c9b2)
                                                  #2  0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                  #3  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #4  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3362:
                                                  #0  0x00007f9d558e59e5 clock_nanosleep (libc.so.6 + 0xcf9e5)
                                                  #1  0x00007f9d558ea5e7 __nanosleep (libc.so.6 + 0xd45e7)
                                                  #2  0x0000561db0f85037 n/a (gamescope + 0x78037)
                                                  #3  0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                  #4  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #5  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3358:
                                                  #0  0x00007f9d55921266 epoll_wait (libc.so.6 + 0x10b266)
                                                  #1  0x00007f9d48148579 n/a (libspa-support.so + 0x13579)
                                                  #2  0x00007f9d4813bbe3 n/a (libspa-support.so + 0x6be3)
                                                  #3  0x00007f9d55eb026f n/a (libpipewire-0.3.so.0 + 0x4126f)
                                                  #4  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #5  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3329:
                                                  #0  0x00007f9d55899f0e n/a (libc.so.6 + 0x83f0e)
                                                  #1  0x00007f9d5589c7a0 pthread_cond_wait (libc.so.6 + 0x867a0)
                                                  #2  0x00007f9d5425c89e n/a (libvulkan_radeon.so + 0x25c89e)
                                                  #3  0x00007f9d54239e0c n/a (libvulkan_radeon.so + 0x239e0c)
                                                  #4  0x00007f9d5425c7cc n/a (libvulkan_radeon.so + 0x25c7cc)
                                                  #5  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #6  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)

                                                  Stack trace of thread 3361:
                                                  #0  0x00007f9d5590f900 __open64 (libc.so.6 + 0xf9900)
                                                  #1  0x0000561db0f5dbe5 n/a (gamescope + 0x50be5)
                                                  #2  0x00007f9d55ae1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                  #3  0x00007f9d5589d44b n/a (libc.so.6 + 0x8744b)
                                                  #4  0x00007f9d55920e40 n/a (libc.so.6 + 0x10ae40)
                                                  ELF object binary architecture: AMD x86-64
Dec 00 00:49:35 steamdeck systemd[1]: systemd-coredump@0-5980-0.service: Deactivated successfully.

I'll retrieve the dumps to provide them.

Steps for reproducing this issue:

  1. Play a game for a while on 3.5.11
  2. See the driver crash at some point with the image freezing on the screen
  3. The screen goes black after a while and the frozen image returns
  4. Gamescope recovers somewhat with the steam menu being visible under the frozen image
  5. Crashes again to a black screen
  6. The gamescope session restarts after a timeout
unclejack commented 9 months ago

These are the gamescope minidumps. 1702672557-gamescope-xwm-942-None.dmp 1702673110-gamescope-xwm-3848-None.dmp 1702676974-gamescope-xwm-3325-None.dmp 1702586226-gamescope-xwm-1007-None.dmp

RodoMa92 commented 9 months ago

Might be related:

https://gitlab.freedesktop.org/drm/amd/-/issues/2220

Long standing power management issue in AMD video drivers.

unclejack commented 9 months ago

The amdgpu driver will most likely be fixed if this problem is something which can be fixed in the software.

The logs show that some sensors were reporting temperatures much higher than what is considered normal. That's what happened when the Steam Deck was connected to its power brick.

I didn't reproduce these failures without having the device connected to the power brick. To be fair, I didn't have much time since reporting the bug to play with the Steam Deck to reproduce the crashes. Is it temporary or is it some kind of permanent damage? I couldn't say.

If the temperatures are this high when the ambient temperature is just 21-22 degrees Celsius, I don't want to think how it'll behave in the summer.

Given the failures seen in the logs, I wonder how Valve does testing. Are there unit tests, integration tests and end to end tests run for these software components which run on SteamOS? I ask because I know there was a tool used to test the GPU drivers a while ago. Perhaps some similar testing can be done for other components (software or hardware) or is already being done.

RodoMa92 commented 9 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1862800940

Saw the temp warning, but I would not say it's the cause of your crash to be honest. The Deck APU should be rated for 105 °C max, so you are quite still in the max. If it was actually overheating, it would just hard shut down to protect itself.

I would blame driver far before the actual hardware, especially if it's on a single title and not across the board.

unclejack commented 9 months ago

I'll post an update with the logs and the dumps if it happens again, regardless of the game.

lostgoat commented 8 months ago

@unclejack there is most likely another error earlier in the logs.

Those errors you posted are currently happening on 3.5 after a GPU reset happens. Gamescope and fan control need to better handle gpu resets, but those errors are probably unrelated to the root cause of the issue which is the game/gpu driver submitting a command that is hanging the gpu.

Check for errors earlier in the log from the amdgpu driver. It will probably have more details.

Do I recall correctly that you also had issues with your unit earlier in the year?

unclejack commented 8 months ago

@lostgoat: That was a different unit which has gone through RMA.

The amdgpu crashed yesterday again. I'll grab the kernel's logs using journalctl and the new dumps to post them here. If there's something else I can do to collect logs or have some kind of traces for the commands which run on the GPU, I'd give that a shot. I can imagine this is likely a common amdgpu bug, not something specific to the rdna2 iGPU from the Steam Deck's SoC.

unclejack commented 8 months ago

I've found several other such crashes in the logs:

Dec 15 22:44:55 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 15 22:45:09 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 15 22:45:09 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=7573, emitted seq=7575
Dec 15 22:45:09 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Dec 15 22:45:09 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 15 22:45:09 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 15 22:45:09 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 23 22:31:25 steamdeck fancontrol.py[556]: Warning: CPU temperature of 94.0 greater than max 90! Setting fan to max speed.
Dec 23 22:31:26 steamdeck fancontrol.py[556]: Warning: CPU temperature of 92.8 greater than max 90! Setting fan to max speed.
Dec 23 22:31:27 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:33:15 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.2 greater than max 90! Setting fan to max speed.
Dec 23 22:48:50 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:51 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:51 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:48:52 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:52 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:48:53 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:54 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:55 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:56 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:57 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:58 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:48:59 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:00 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:01 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:02 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:03 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:04 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:05 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:06 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:07 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:08 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:09 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:10 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:11 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:12 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:13 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:14 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:15 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:16 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:17 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:31 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:32 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:33 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:34 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:35 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:36 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:37 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:38 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:49:38 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:39 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:40 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:49:41 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:10 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:11 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:12 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:13 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:14 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:15 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:16 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:17 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:18 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:19 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:20 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:21 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:22 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:23 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:24 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:25 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:26 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 22:50:55 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:50:56 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.4 greater than max 90! Setting fan to max speed.
Dec 23 22:55:25 steamdeck fancontrol.py[556]: Warning: CPU temperature of 92.6 greater than max 90! Setting fan to max speed.
Dec 23 22:55:26 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 22:58:58 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 23 22:58:58 steamdeck kernel: perf: interrupt took too long (2546 > 2500), lowering kernel.perf_event_max_sample_rate to 78300
Dec 23 23:01:12 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.4 greater than max 90! Setting fan to max speed.
Dec 23 23:01:13 steamdeck fancontrol.py[556]: Warning: CPU temperature of 97.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:13 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:01:14 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.8 greater than max 90! Setting fan to max speed.
Dec 23 23:01:14 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:01:15 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:16 steamdeck fancontrol.py[556]: Warning: CPU temperature of 95.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:17 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.8 greater than max 90! Setting fan to max speed.
Dec 23 23:01:18 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:19 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:20 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:21 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:22 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:23 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:24 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:25 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:26 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:27 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:28 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:29 steamdeck fancontrol.py[556]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:30 steamdeck fancontrol.py[556]: Warning: CPU temperature of 92.2 greater than max 90! Setting fan to max speed.
Dec 23 23:01:31 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:32 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:33 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:34 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:35 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:36 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:37 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:38 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:39 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:40 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:41 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:42 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:43 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:44 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:45 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:46 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:47 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:48 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:49 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:50 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:51 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:52 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:53 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:54 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:55 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:56 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:57 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:58 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:01:59 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:00 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:01 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:02 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:03 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:04 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:05 steamdeck fancontrol.py[556]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed.
Dec 23 23:02:23 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:24 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:25 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:26 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:27 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:28 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:29 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:30 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:31 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:32 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:33 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:34 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:35 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:36 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:37 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:38 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:02:39 steamdeck vpower[581]: read /sys/class/power_supply/BAT1/current_now: No such device (os error 19)
Dec 23 23:07:13 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=21419, emitted seq=21421
Dec 23 23:07:13 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Dec 23 23:07:13 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Dec 23 23:07:13 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Dec 23 23:07:13 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 23 23:07:13 steamdeck kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
Dec 23 23:07:13 steamdeck kernel: [drm] PSP is resuming...
Dec 23 23:07:13 steamdeck (udev-worker)[3769]: devcd1: Process 'cat /sys/devices/virtual/devcoredump/devcd1/data > /var/lib/steamos-log-submitter/pending/devcoredump/4556' failed with exit code 1.
Dec 23 23:07:13 steamdeck kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
Dec 23 23:07:13 steamdeck fancontrol.py[556]: Traceback (most recent call last):
Dec 23 23:07:13 steamdeck fancontrol.py[556]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 542, in <module>
Dec 23 23:07:13 steamdeck fancontrol.py[556]:     controller.loop_control()
Dec 23 23:07:13 steamdeck fancontrol.py[556]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 486, in loop_control
Dec 23 23:07:13 steamdeck fancontrol.py[556]:     self.loop_read_sensors()
Dec 23 23:07:13 steamdeck fancontrol.py[556]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 452, in loop_read_sensors
Dec 23 23:07:13 steamdeck fancontrol.py[556]:     self.power_sensor.get_avg_value()
Dec 23 23:07:13 steamdeck fancontrol.py[556]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 356, in get_avg_value
Dec 23 23:07:13 steamdeck fancontrol.py[556]:     self.values.append(self.get_value())
Dec 23 23:07:13 steamdeck fancontrol.py[556]:                        ^^^^^^^^^^^^^^^^
Dec 23 23:07:13 steamdeck fancontrol.py[556]:   File "/usr/share/jupiter-fan-control/fancontrol.py", line 351, in get_value
Dec 23 23:07:13 steamdeck fancontrol.py[556]:     self.value = int(f.read().strip()) / 1000000
Dec 23 23:07:13 steamdeck fancontrol.py[556]:                      ^^^^^^^^
Dec 23 23:07:13 steamdeck fancontrol.py[556]: PermissionError: [Errno 1] Operation not permitted
Dec 23 23:07:13 steamdeck systemd[1]: jupiter-fan-control.service: Main process exited, code=exited, status=1/FAILURE
Dec 23 23:07:13 steamdeck fancontrol.py[3774]: loaded critical temp from SSD hwmon: 79.85
Dec 23 23:07:13 steamdeck fancontrol.py[3774]: returning fan to EC control loop
Dec 23 23:07:13 steamdeck systemd[1]: jupiter-fan-control.service: Failed with result 'exit-code'.
Dec 23 23:07:13 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 12.569s CPU time.
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
Dec 23 23:07:14 steamdeck kernel: [drm] DMUB hardware initialized: version=0x0300000A
Dec 23 23:07:14 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Dec 23 23:07:14 steamdeck kernel: [drm] kiq ring mec 2 pipe 1 q 0
Dec 23 23:07:14 steamdeck kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Dec 23 23:07:14 steamdeck kernel: [drm] JPEG decode initialized successfully.
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
Dec 23 23:07:14 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!

There are no other messages logged between the first and the last logged message for the last crash from yesterday. I'm not sure what's going on with the battery to cause those errors and why the fan control fails the way it does.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1868499713

Fan service dies since it can't poll GPU temperatures anymore since the driver died, if I have to guess. It should be irrelevant. The crash to me looks like the linked above issue tho. Might be worth to include the patch provided by Mario in that thread, it should be easily back portable to 6.1 LTS.

RodoMa92 commented 8 months ago

These are the two potential patches: https://lore.kernel.org/amd-gfx/20231208225328.25651-1-alexander.deucher@amd.com/T/#u https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch

unclejack commented 8 months ago

I see there are multiple patches which are either being prepared or are already committed. Several patches seem to be necessary. There's also a new binary and source kernel package from Valve. I'll give this a few days and maybe try those patches to check if they help.

hrvylein commented 8 months ago

@unclejack can you elaborate where you found the updated kernel package and how we can check if the patches are applied there? Since 3.5.x some games aren't playable anymore as my steam deck randomly hard locks and I fear that there might be no update/hotfix from Valve in the next weeks, as from my understand these patches are only workarounds to mitigate the hard crashes till the originating problem is found.

RodoMa92 commented 8 months ago

@unclejack can you elaborate where you found the updated kernel package and how we can check if the patches are applied there? Since 3.5.x some games aren't playable anymore as my steam deck randomly hard locks and I fear that there might be no update/hotfix from Valve in the next weeks, as from my understand these patches are only workarounds to mitigate the hard crashes till the originating problem is found.

You can find the latest kernel sources build packages here: https://gitlab.com/evlaV/jupiter-PKGBUILD/-/tree/master/linux-neptune-61?ref_type=heads

Including them inside the sources in a .patch filename will pull them in before compilation.

More details here: https://wiki.archlinux.org/title/PKGBUILD

unclejack commented 8 months ago

@hrvylein: That appears to be the kernel included in the latest 3.5.12 preview update. It probably doesn't make sense to try to build from sources to apply these patches if you don't encounter the crashes.

The 3.5.12 preview update also crashed after a while. This particular crash didn't close the game for some reason. I had to kill it by hand, thus taking down gamescope once again after the initial crash.

The odd thing this time was that the crash didn't occur as quickly as it did last time. It also didn't crash the day before. The logs were the same on the kernel side. There were no new details provided.

added later:

[13586.590300] wlan0: associated
[13587.145112] rtw_8822ce 0000:03:00.0: failed to get tx report from firmware
[13722.466653] wlan0: disconnect from AP <AP 2.4 GHz> for new auth to <AP 5GHz>
[13722.543243] wlan0: authenticate with <AP 5GHz>
[13723.007507] wlan0: send auth to <AP 5GHz> (try 1/3)
[13723.010844] wlan0: authenticated
[13723.012637] wlan0: associate with <AP 5GHz> (try 1/3)
[13723.017413] wlan0: RX ReassocResp from <AP 5GHz> (capab=0x111 status=0 aid=1)
[13723.017773] wlan0: associated
[13723.062862] wlan0: Limiting TX power to 17 (20 - 3) dBm as advertised by <AP 5GHz>
[13729.230559] wlan0: disconnect from AP <AP 5GHz> for new auth to <AP 2.4 GHz>
[13729.393292] wlan0: authenticate with <AP 2.4 GHz>
[13729.393335] wlan0: 80 MHz not supported, disabling VHT
[13729.837556] wlan0: send auth to <AP 2.4 GHz> (try 1/3)
[13729.840968] wlan0: authenticated
[13729.842668] wlan0: associate with <AP 2.4 GHz> (try 1/3)
[13729.847447] wlan0: RX ReassocResp from <AP 2.4 GHz> (capab=0x431 status=0 aid=3)
[13729.847795] wlan0: associated
[13741.520209] wlan0: disconnect from AP <AP 2.4 GHz> for new auth to <AP 5GHz>
[13741.636713] wlan0: authenticate with <AP 5GHz>
[13742.104284] wlan0: send auth to <AP 5GHz> (try 1/3)
[13742.107619] wlan0: authenticated
[13742.109417] wlan0: associate with <AP 5GHz> (try 1/3)
[13742.114130] wlan0: RX ReassocResp from <AP 5GHz> (capab=0x111 status=0 aid=1)
[13742.114454] wlan0: associated
[13742.212310] wlan0: Limiting TX power to 17 (20 - 3) dBm as advertised by <AP 5GHz>
[13748.303767] wlan0: disconnect from AP <AP 5GHz> for new auth to <AP 2.4 GHz>
[13748.410068] wlan0: authenticate with <AP 2.4 GHz>
[13748.410083] wlan0: 80 MHz not supported, disabling VHT
[13748.782948] rtw_8822ce 0000:03:00.0: failed to do dpk calibration

The dpk calibration and the failed to get tx report from firmware errors don't look too good. Please let me know if a new ticket should be opened for that.

The Steam Deck crashed again. It didn't recover this time. It just sat with the game on screen frozen.

hrvylein commented 8 months ago

@unclejack I didn't have the time to look into the whole kernel thing and I didn't build a linux kernel before. Setting up the build chain is for sure challenging.

What I don't understand is, that most of the games are still perfectly fine on my steam deck, while one might crash randomly using Proton GE 8.25 and one with a native linux build will crash in the first 10 minutes of the game. Other games (verified for deck, didn't touch proton settings or added anything to startup cmd) run for hours and days. Do you encounter the same or are the crashes evenly distributed and there is no difference in the games played? I really can't make up if the device is faulty or it's steam os.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1871302156

If you don't have any wireless issues they should be irrelevant. These happens also on my unit often, and realtek didn't aknowledged them in another unrelated issue with wireless, so they might be just a red herring.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1872276023

If the crash looks the same as @unclejack than yeah, it's a random issue that happens when the GPU is temporarily in GFXOFF state. It should happen with lighter games. These patches above should help then.

hrvylein commented 8 months ago

In Desktop mode I was presented with a mesa 23.3.0 refresh update (2 packages and 1 Mesa extra Update) but I think Game mode uses 23.1.x If i'm right?

RodoMa92 commented 8 months ago

In Desktop mode I was presented with a mesa 23.3.0 refresh update (2 packages and 1 Mesa extra Update) but I think Game mode uses 23.1.x If i'm right?

Yeah, that's just the MESA flatpak version, it's unrelated from the system version. You can check it from the system menu in game mode.

hrvylein commented 8 months ago

Im trying to provoke the crash with a native linux game in game mode. Where is the crash log stored when mit using proton_log?

unclejack commented 8 months ago

@hrvylein: Please keep in mind that I'm not a specialist when it comes to graphics programming, graphics APIs and GPU drivers. This is an explanation. Let's imagine we write a program which we test with some given data. One day another user gives it different input data. The program crashes because we didn't do a thorough job with the code which is supposed to handle all possible valid data.

A game which crashes right away or in 10 minutes in a very deterministic manner is a good thing. This means that there's a bug which can be reproduced and fixes can be tested easily. Some games which crash only after some time may leak memory (memory which they allocate, use, fail to free and they keep allocating more until they crash).

Then there are games which cause a GPU driver crash because the GPU resets. This might be a problem with handling some specific sequence of commands or some other issue in the GPU driver. This last type of issue is much harder to fix and more frustrating. It can be something one encounters every 20-30 minutes or it might be something one encounters once every 2-3 days when playing 1-2 hours every day. It is also very dependent on the game. Some games might run for longer than any human can play them without any issue.

One way to solve such bugs might be to do fuzzing for the graphics APIs. Commands sent to the GPU could be recorded and played back to reproduce the failures. An alternative would be to generate a large number of such valid sequences of commands in an attempt to crash the driver. This has been done already to find CPU bugs and undefined behavior in CPUs.

The different proton versions can carry game specific patches for wine itself, for vkd3d or for dxvk. Some games can also crash on startup due to missing codecs, DLLs or other problems. Once again, deterministic crashes are better than the random ones.

You can retrieve the logs via journalctl for steam and dmesg should give you the kernel's log. The kernel's log will provide details such as the ones I've posted above. This will also be present in journalctl. The kernel logs from journalctl will be mixed with those from many other services.

@RodoMa92: These logged errors might not be relevant for the wireless issues. I do seem to have issues with Steam thinking it's not connected to the Internet from time to time and one specific game complaining that it's disconnected from the game's servers.

RodoMa92 commented 8 months ago

... @RodoMa92: These logged errors might not be relevant for the wireless issues. I do seem to have issues with Steam thinking it's not connected to the Internet from time to time and one specific game complaining that it's disconnected from the game's servers.

What wireless network hardware do you have at home? I had your same symptoms randomly while using an AP with ath10k ac wireless (Archer C7) on OpenWRT but since I got too much annoyed from Realtek/Valve not acting on it/caring enough I swapped the AP with a MT6915 AX radio + OpenWRT for 50 bucks and I haven't been able to reproduce it yet again, although my issues has gone down in frequency since I left SteamOS for a more recent Linux kernel (it was unusable on 6.1 LTS).

Feel free to ping them here: https://bugzilla.kernel.org/show_bug.cgi?id=217782

unclejack commented 8 months ago

@RodoMa92: I had a similar setup with C7 v2. What's the other hardware with the mediatek radio? I might also change it to check if that helps. Realtek wifi chips are really poor choices for devices with wifi. The ath11k found in the OLED Deck also has most of its control buried in its binary firmware blob as far as I know. The mediatek mt76 based devices are likely to be the best choice.

It's unfortunate that the wifi is soldered to the mainboard of the Steam Deck. There's absolutely no straightforward way to replace it. It's only possible to replace it with a soldering gun.

Regarding your ONT, perhaps you can put that in bridge mode to get rid of the ONT's wifi and use your router instead for routing duties.

By the way, you can install SteamOS on an external drive using the recovery image on another computer. That should help with testing.

As for the ticket itself and the GPU driver crashes, it seems fuzzing has been done already to some extent for various drivers. The developers and the people who hack on mesa are probably familiar with all the tools used for fuzzing GPU drivers. Perhaps traces could be collected from games running via vkd3d and dxvk to generate extremely long sequences of commands derived from the traces. The tools I've found may already be superseded by other tools: https://github.com/google/graphicsfuzz.

If anyone has recommendations for what traces to collect, please let me know.

RodoMa92 commented 8 months ago

@RodoMa92: I had a similar setup with C7 v2. What's the other hardware with the mediatek radio? I might also change it to check if that helps. Realtek wifi chips are really poor choices for devices with wifi. The ath11k found in the OLED Deck also has most of its control buried in its binary firmware blob as far as I know. The mediatek mt76 based devices are likely to be the best choice.

Yeah, Qualcomm is even worse in that regard, the "open source" driver part is basically just a shim over 6 MB of black box stuff. At least with Realtek it's only 250 KB. The other router is a broadcom AC provided router. I didn't believe that Mediatek of all hardware manufacturer would be the best choice as open source support goes. Feel free to test the C7 and give feedback to Realtek on your results, maybe they'll act on it. I already contacted them by saying that you are the third user with stalls and a Archer C7, so I would guess that the issue is with their firmware.

It's unfortunate that the wifi is soldered to the mainboard of the Steam Deck. There's absolutely no straightforward way to replace it. It's only possible to replace it with a soldering gun.

Yep, otherwise I would already have thrown it into the garbage as of right now.

Regarding your ONT, perhaps you can put that in bridge mode to get rid of the ONT's wifi and use your router instead for routing duties.

I would avoid the double box if I can, but here where I leave the ISP has the duty to leave you the choice of the router, so they have to provide the ont to ethernet themselves for free. Problem is that I would also need a POTS compatible router, which basically do not exists and then I would still have two separate boxes again.

As for the ticket itself and the GPU driver crashes, it seems fuzzing has been done already to some extent for various drivers. The developers and the people who hack on mesa are probably familiar with all the tools used for fuzzing GPU drivers. Perhaps traces could be collected from games running via vkd3d and dxvk to generate extremely long sequences of commands derived from the traces. The tools I've found may already be superseded by other tools: https://github.com/google/graphicsfuzz.

If anyone has recommendations for what traces to collect, please let me know.

Fuzzing graphics hardware is hard since given the numbers of API calls you can't really do much checking on the actual parameters IIRC, so some crashes/errors are kinda expected if the call itself is malformed. However it should still recover gracefully and not die completely like here for sure.

Sunspark-007 commented 8 months ago

I use a C7 V2, one correction I would make here is on what wifi chip you think it uses. It uses Atheros for both bands.

I don't use 5 GHz though, and mine has been on OpenWRT firmware for years now. You could try that.

RodoMa92 commented 8 months ago

For wireless issues, either create a new thread or if it's the LCD feel free to add them here: https://github.com/ValveSoftware/SteamOS/issues/1119

unclejack commented 8 months ago
[   84.072605] [drm] Failed to add display topology, DTM TA is not initialized.
[  104.041788] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=5606, emitted seq=5610
[  104.042370] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  104.042919] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[  104.152539] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[  104.162714] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[  104.163235] [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
[  104.163314] [drm] PSP is resuming...
[  104.185505] [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
[  105.046797] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[  105.047768] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[  105.057815] [drm] DMUB hardware initialized: version=0x0300000A
[  105.133882] [drm] Failed to add display topology, DTM TA is not initialized.
[  105.145944] [drm] kiq ring mec 2 pipe 1 q 0
[  105.148101] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  105.148400] [drm] JPEG decode initialized successfully.
[  105.148405] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  105.148408] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  105.148409] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  105.148410] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  105.148411] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  105.148412] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  105.148413] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  105.148414] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  105.148415] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  105.148416] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  105.148417] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  105.148418] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  105.148419] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[  105.148420] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[  105.148421] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[  105.151616] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
[  105.151618] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
[  105.151629] amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
[  141.038201] input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb3/3-3/3-3:1.2/0003:28DE:1205.0003/input/input26

The driver crashed after starting the game from the main Steam Deck screen right after a cold boot. The Steam Deck was stored completely shut down in its case before being turned on. Gamescope crashed and the whole system recovered after a while.

[  236.453625] cs35l41 spi-VLV1776:01: DSP1: Firmware version: 3
[  236.453637] cs35l41 spi-VLV1776:01: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[  236.453656] cs35l41 spi-VLV1776:00: DSP1: Firmware version: 3
[  236.453667] cs35l41 spi-VLV1776:00: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[  236.712319] cs35l41 spi-VLV1776:01: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[  236.712532] cs35l41 spi-VLV1776:00: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[  236.712928] cs35l41 spi-VLV1776:01: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[  236.712934] cs35l41 spi-VLV1776:01: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[  236.712941] cs35l41 spi-VLV1776:01: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[  236.713179] cs35l41 spi-VLV1776:00: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[  236.713186] cs35l41 spi-VLV1776:00: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[  236.713192] cs35l41 spi-VLV1776:00: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[  236.761467] cs35l41 spi-VLV1776:01: DSP1: Legacy support not available
[  236.763073] cs35l41 spi-VLV1776:00: DSP1: Legacy support not available
[  237.327711] cs35l41 spi-VLV1776:00: DSP1: Firmware version: 3
[  237.327721] cs35l41 spi-VLV1776:00: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[  237.479320] cs35l41 spi-VLV1776:00: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[  237.479675] cs35l41 spi-VLV1776:00: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[  237.479682] cs35l41 spi-VLV1776:00: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[  237.479689] cs35l41 spi-VLV1776:00: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[  237.505821] cs35l41 spi-VLV1776:01: DSP1: Firmware version: 3
[  237.505831] cs35l41 spi-VLV1776:01: DSP1: cs35l41-dsp1-spk-prot.wmfw: Fri 02 Apr 2021 21:03:50 W. Europe Daylight Time
[  237.644561] cs35l41 spi-VLV1776:01: DSP1: Firmware: 400a4 vendor: 0x2 v0.33.0, 2 algorithms
[  237.644810] cs35l41 spi-VLV1776:01: DSP1: 0: ID cd v29.53.0 XM@94 YM@e
[  237.644815] cs35l41 spi-VLV1776:01: DSP1: 1: ID f20b v0.0.1 XM@170 YM@0
[  237.644820] cs35l41 spi-VLV1776:01: DSP1: Protection: C:\Users\ocanavan\Desktop\cirrusTune_july2021.bin
[  242.623372] input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb3/3-3/3-3:1.2/0003:28DE:1205.0003/input/input27
[  243.928525] input: Microsoft X-Box 360 pad 0 as /devices/virtual/input/input28
[  244.493639] systemd-gpt-auto-generator[3331]: EFI loader partition unknown, exiting.
[  244.493646] systemd-gpt-auto-generator[3331]: (The boot loader did not set EFI variable LoaderDevicePartUUID.)
[  244.982187] systemd-gpt-auto-generator[3357]: EFI loader partition unknown, exiting.
[  244.982197] systemd-gpt-auto-generator[3357]: (The boot loader did not set EFI variable LoaderDevicePartUUID.)
[  246.224499] cs35l41 spi-VLV1776:01: DSP1: Legacy support not available
[  246.225712] cs35l41 spi-VLV1776:00: DSP1: Legacy support not available
RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1873379305

This looks identical to the amd freedesktop I've reported above. Did you have applied the above mentioned patches? Still the same issue then?

hrvylein commented 8 months ago

Had no crash for some days in a row and all of a sudden I have the crash again. Crash log looks familiar ...

Jan 02 21:35:19  kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=4775456, emitted seq=4775458
Jan 02 21:35:19  kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 
Jan 02 21:35:19  kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Jan 02 21:35:19  kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
Jan 02 21:35:19  kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 02 21:35:19  kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
Jan 02 21:35:19  kernel: [drm] PSP is resuming...
Jan 02 21:35:19  (udev-worker)[17488]: devcd1: Process 'cat /sys/devices/virtual/devcoredump/devcd1/data > /var/lib/steamos-log-submitter/pending/devcoredump/4903' failed with exit code 1.
Jan 02 21:35:19  kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
Jan 02 21:35:20  kernel: [drm] DMUB hardware initialized: version=0x0300000A
Jan 02 21:35:20  kernel: [drm] Failed to add display topology, DTM TA is not initialized.
Jan 02 21:35:20  kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jan 02 21:35:20  kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Jan 02 21:35:20  kernel: [drm] JPEG decode initialized successfully.
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
Jan 02 21:35:20  kernel: [drm] Skip scheduling IBs!
Jan 02 21:35:20  kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(2) succeeded!

I would really like to test the kernel patches.

RodoMa92 commented 8 months ago

Just got the same randomly today while playing Terraria (low GPU usage, as expected), on my end however it recovered fine (but gamescope died anyway). Log below:

[10229.343630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=34098, emitted seq=34100
[10229.344393] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[10229.344895] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[10229.436228] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[10229.446588] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[10229.447087] [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000).
[10229.447236] [drm] PSP is resuming...
[10229.469394] [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR
[10230.155567] amdgpu 0000:04:00.0: amdgpu: SMU is resuming...
[10230.155874] amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully!
[10230.165827] [drm] DMUB hardware initialized: version=0x0300000A
[10230.244640] [drm] Failed to add display topology, DTM TA is not initialized.
[10230.258284] [drm] kiq ring mec 2 pipe 1 q 0
[10230.260215] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[10230.260666] [drm] JPEG decode initialized successfully.
[10230.260672] amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[10230.260678] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[10230.260682] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[10230.260685] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[10230.260688] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[10230.260691] amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[10230.260694] amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[10230.260698] amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[10230.260701] amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[10230.260705] amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[10230.260708] amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[10230.260711] amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[10230.260715] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[10230.260718] amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[10230.260722] amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[10230.264045] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start
[10230.264051] amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done
[10230.264073] amdgpu 0000:04:00.0: amdgpu: GPU reset(1) succeeded!
[10230.264087] [drm] Skip scheduling IBs!

I'm running the latest kernel, 6.6.8 on Bazzite, so it's not SteamOS dependent.

unclejack commented 8 months ago

@RodoMa92: No, I haven't tested it yet.

Since Bazzite has kernel 6.6.8, at least one sdma related patch is included: https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.8 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.6.8&id=3aae4ef4d799fb3d0381157640fdb251008cf0ae

This confirms that Mario's patch doesn't help at all with the issue we're running into.

The other patch (https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch) might help mask the issue without fixing the root cause. It leaves me with the impression that it's the equivalent of adding sleep() statements in some code to work around a race condition. I'm not saying it wouldn't be great to avoid the crashes. It's just that it feels wrong to not address the underlying root cause.

I'll grab the sources and build a kernel with that patch later. The thing is I won't even know whether this fixes the issue or if it simply didn't occur during the time I was testing the change.

The more recent reports on https://gitlab.freedesktop.org/drm/amd/-/issues/2220 with kernel 6.6.8 give the impression that AMD has quite some work to do to address this. The existing patches aren't likely to help much.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1875006252

Good catch, didn't knew that Mario's patch was already in 6.6.8. Added the log and the report to AMD's issue report.

The second one is literally forcing a wait for 15s on the SDMA ring instead of 10s before forcefully killing the gpu driver and restart it, so yeah, it's literally the definition of a workaround; but if it would improve things without causing other issues I would still try it.

unclejack commented 8 months ago

I've got the new kernel package built based on 6.1.52-valve12 with https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch applied on top. Let's see how it goes.

If the crash occurs again, it means that the patch doesn't help with the particular issue we run into.

hrvylein commented 8 months ago

@unclejack mind sharing the kernel If your first tests seem promising? Anyone knows why there is No Response from a dev here?

RodoMa92 commented 8 months ago

I've got the new kernel package built based on 6.1.52-valve12 with https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch applied on top. Let's see how it goes.

If the crash occurs again, it means that the patch doesn't help with the particular issue we run into.

Are you sure that the first patch is already applied on top of the Valve's kernel? I'm not sure it's merged already in 6.1 LTS or the Valve's custom base.

nultiee commented 8 months ago

I'm getting a near identical stacktrace to @unclejack - gamescope core dumping, preceded by the fan control issues. See below.

I used to be getting the same experience as you, frozen image with the steam UI partially available before cutting to a black screen. However now it doesn't do this. It just reboots the device from scratch. The game instantly turns off and I'm presented with a "Verifying Installation" loading screen and then the Steam home screen.

Click to expand ``` Jan 03 23:49:41 steamdeck fancontrol.py[6093]: Warning: CPU temperature of 91.0 greater than max 90! Setting fan to max speed. Jan 03 23:53:28 steamdeck fancontrol.py[6093]: Warning: CPU temperature of 93.0 greater than max 90! Setting fan to max speed. Jan 03 23:53:29 steamdeck fancontrol.py[6093]: Warning: CPU temperature of 91.2 greater than max 90! Setting fan to max speed. Jan 03 23:53:47 steamdeck kernel: perf: interrupt took too long (3142 > 3141), lowering kernel.perf_event_max_sample_rate to 63600 Jan 03 23:57:51 steamdeck kernel: [drm:gfx_v10_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream Jan 03 23:57:51 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1007300, emitted seq=1007302 Jan 03 23:57:51 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process bg3_dx11.exe pid 7465 thread dxvk-submit pid 7506 Jan 03 23:57:51 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin! Jan 03 23:57:51 steamdeck fancontrol.py[6093]: Traceback (most recent call last): Jan 03 23:57:51 steamdeck fancontrol.py[6093]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 542, in Jan 03 23:57:51 steamdeck fancontrol.py[6093]: controller.loop_control() Jan 03 23:57:51 steamdeck fancontrol.py[6093]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 486, in loop_control Jan 03 23:57:51 steamdeck fancontrol.py[6093]: self.loop_read_sensors() Jan 03 23:57:51 steamdeck fancontrol.py[6093]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 452, in loop_read_sensors Jan 03 23:57:51 steamdeck fancontrol.py[6093]: self.power_sensor.get_avg_value() Jan 03 23:57:51 steamdeck fancontrol.py[6093]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 356, in get_avg_value Jan 03 23:57:51 steamdeck fancontrol.py[6093]: self.values.append(self.get_value()) Jan 03 23:57:51 steamdeck fancontrol.py[6093]: ^^^^^^^^^^^^^^^^ Jan 03 23:57:51 steamdeck fancontrol.py[6093]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 351, in get_value Jan 03 23:57:51 steamdeck fancontrol.py[6093]: self.value = int(f.read().strip()) / 1000000 Jan 03 23:57:51 steamdeck fancontrol.py[6093]: ^^^^^^^^ Jan 03 23:57:51 steamdeck fancontrol.py[6093]: PermissionError: [Errno 1] Operation not permitted Jan 03 23:57:51 steamdeck systemd[1]: jupiter-fan-control.service: Main process exited, code=exited, status=1/FAILURE Jan 03 23:57:52 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset Jan 03 23:57:52 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume Jan 03 23:57:52 steamdeck kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000). Jan 03 23:57:52 steamdeck kernel: [drm] PSP is resuming... Jan 03 23:57:52 steamdeck (udev-worker)[8137]: devcd2: Process 'cat /sys/devices/virtual/devcoredump/devcd2/data > /var/lib/steamos-log-submitter/pending/devcoredump/4718' failed with exit code 1. Jan 03 23:57:52 steamdeck kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR Jan 03 23:57:52 steamdeck fancontrol.py[8132]: loaded critical temp from SSD hwmon: 94.85 Jan 03 23:57:52 steamdeck fancontrol.py[8132]: returning fan to EC control loop Jan 03 23:57:52 steamdeck systemd[1]: jupiter-fan-control.service: Failed with result 'exit-code'. Jan 03 23:57:52 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 11.880s CPU time. Jan 03 23:57:52 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming... Jan 03 23:57:52 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully! Jan 03 23:57:52 steamdeck kernel: [drm] DMUB hardware initialized: version=0x0300000A Jan 03 23:57:53 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized. Jan 03 23:57:53 steamdeck kernel: [drm] kiq ring mec 2 pipe 1 q 0 Jan 03 23:57:53 steamdeck kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode). Jan 03 23:57:53 steamdeck kernel: [drm] JPEG decode initialized successfully. Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done Jan 03 23:57:53 steamdeck kernel: [drm] Skip scheduling IBs! Jan 03 23:57:53 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(4) succeeded! Jan 03 23:57:53 steamdeck systemd[1]: jupiter-fan-control.service: Scheduled restart job, restart counter is at 2. Jan 03 23:57:53 steamdeck systemd[1]: Stopped Jupiter fan control. Jan 03 23:57:53 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 11.880s CPU time. Jan 03 23:57:53 steamdeck systemd[1]: Started Jupiter fan control. Jan 03 23:57:53 steamdeck fancontrol.py[8164]: loaded critical temp from SSD hwmon: 94.85 Jan 03 23:57:53 steamdeck fancontrol.py[8164]: jupiter-fan-control started successfully. Jan 03 23:57:53 steamdeck systemd[1]: Started Process Core Dump (PID 8186/UID 0). Jan 03 23:57:53 steamdeck core_handler[8187]: Minidump generated at /var/lib/steamos-log-submitter/pending/minidump/.staging-1704326273-gamescope-xwm-6184-None.dmp Jan 03 23:57:53 steamdeck gpu-trace[563]: INFO - Executing get tracing status command Jan 03 23:57:53 steamdeck gpu-trace[563]: 127.0.0.1 - - [03/Jan/2024 23:57:53] "POST / HTTP/1.1" 200 - Jan 03 23:57:53 steamdeck gpu-trace[563]: INFO - Executing get tracing status command Jan 03 23:57:53 steamdeck gpu-trace[563]: 127.0.0.1 - - [03/Jan/2024 23:57:53] "POST / HTTP/1.1" 200 - Jan 03 23:57:53 steamdeck gpu-trace[563]: INFO - Executing get tracing status command Jan 03 23:57:53 steamdeck gpu-trace[563]: 127.0.0.1 - - [03/Jan/2024 23:57:53] "POST / HTTP/1.1" 200 - Jan 03 23:57:54 steamdeck systemd-coredump[8188]: Process 6184 (gamescope-wl) of user 1000 dumped core. Stack trace of thread 6220: #0 0x00007f0962e9f26c n/a (libc.so.6 + 0x8926c) #1 0x00007f0962e4fa08 raise (libc.so.6 + 0x39a08) #2 0x00007f0962e38538 abort (libc.so.6 + 0x22538) #3 0x00007f0962e3845c n/a (libc.so.6 + 0x2245c) #4 0x00007f0962e483d6 __assert_fail (libc.so.6 + 0x323d6) #5 0x000055d42cffa647 n/a (gamescope + 0x7d647) #6 0x000055d42d0bf33f n/a (gamescope + 0x14233f) #7 0x000055d42cfd74fd n/a (gamescope + 0x5a4fd) #8 0x000055d42cfeb501 n/a (gamescope + 0x6e501) #9 0x000055d42cfecb7f n/a (gamescope + 0x6fb7f) #10 0x00007f09630e1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943) #11 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #12 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6184: #0 0x00007f0962f13c0f __poll (libc.so.6 + 0xfdc0f) #1 0x000055d42cff35ff n/a (gamescope + 0x765ff) #2 0x000055d42cf9f353 n/a (gamescope + 0x22353) #3 0x00007f0962e39850 n/a (libc.so.6 + 0x23850) #4 0x00007f0962e3990a __libc_start_main (libc.so.6 + 0x2390a) #5 0x000055d42cfc12c5 n/a (gamescope + 0x442c5) Stack trace of thread 6187: #0 0x00007f0962e99f0e n/a (libc.so.6 + 0x83f0e) #1 0x00007f0962e9c7a0 pthread_cond_wait (libc.so.6 + 0x867a0) #2 0x00007f0961e5c41e n/a (libvulkan_radeon.so + 0x25c41e) #3 0x00007f0961e3998c n/a (libvulkan_radeon.so + 0x23998c) #4 0x00007f0961e5c34c n/a (libvulkan_radeon.so + 0x25c34c) #5 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #6 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6218: #0 0x00007f0962f21266 epoll_wait (libc.so.6 + 0x10b266) #1 0x00007f0959df5579 n/a (libspa-support.so + 0x13579) #2 0x00007f0959de8be3 n/a (libspa-support.so + 0x6be3) #3 0x00007f096346c26f n/a (libpipewire-0.3.so.0 + 0x4126f) #4 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #5 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6186: #0 0x00007f0962f13c0f __poll (libc.so.6 + 0xfdc0f) #1 0x000055d42cff2a17 n/a (gamescope + 0x75a17) #2 0x00007f09630e1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943) #3 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #4 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6223: #0 0x00007f0962e99f0e n/a (libc.so.6 + 0x83f0e) #1 0x00007f0962e9c7a0 pthread_cond_wait (libc.so.6 + 0x867a0) #2 0x00007f09630d9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11) #3 0x000055d42cfcdee5 n/a (gamescope + 0x50ee5) #4 0x00007f09630e1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943) #5 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #6 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6188: #0 0x00007f0962ee59e5 clock_nanosleep (libc.so.6 + 0xcf9e5) #1 0x00007f0962eea5e7 __nanosleep (libc.so.6 + 0xd45e7) #2 0x00007f0961d00455 n/a (libvulkan_radeon.so + 0x100455) #3 0x00007f0961e5c34c n/a (libvulkan_radeon.so + 0x25c34c) #4 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #5 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6219: #0 0x00007f0962f13c0f __poll (libc.so.6 + 0xfdc0f) #1 0x000055d42d017512 n/a (gamescope + 0x9a512) #2 0x00007f09630e1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943) #3 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #4 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6222: #0 0x00007f0962ee59e5 clock_nanosleep (libc.so.6 + 0xcf9e5) #1 0x00007f0962eea5e7 __nanosleep (libc.so.6 + 0xd45e7) #2 0x000055d42cff303f n/a (gamescope + 0x7603f) #3 0x00007f09630e1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943) #4 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #5 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) Stack trace of thread 6221: #0 0x00007f0962f0f900 __open64 (libc.so.6 + 0xf9900) #1 0x000055d42cfcd2d5 n/a (gamescope + 0x502d5) #2 0x00007f09630e1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943) #3 0x00007f0962e9d44b n/a (libc.so.6 + 0x8744b) #4 0x00007f0962f20e40 n/a (libc.so.6 + 0x10ae40) ELF object binary architecture: AMD x86-64 Jan 03 23:57:54 steamdeck gpu-trace[563]: INFO - Executing get tracing status command Jan 03 23:57:54 steamdeck gpu-trace[563]: 127.0.0.1 - - [03/Jan/2024 23:57:54] "POST / HTTP/1.1" 200 - Jan 03 23:57:54 steamdeck kernel: input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb1/1-3/1-3:1.2/0003:28DE:1205.000A/input/input35 Jan 03 23:57:54 steamdeck systemd[1]: systemd-coredump@1-8186-0.service: Deactivated successfully. Jan 03 23:57:59 steamdeck systemd[1009]: gamescope-session.service: Consumed 4h 3min 28.541s CPU time. ```

gamescope minidumps: 1704326273-gamescope-xwm-6184-None.dmp 1704326684-gamescope-xwm-8283-None.dmp 1704323612-gamescope-xwm-1082-None.dmp

Steam client version: 1702079146 SteamOS version: 3.5.7 Opted into Steam client beta?: No Opted into SteamOS beta?: No Have you checked for updates in Settings > System?: Yes

I'm very curious to see how outcome of your test @unclejack.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1876859340

Nope, this is a proper driver crash from Baldurs Gate 3, unrelated to this. Feel free to open another issue on this.

unclejack commented 8 months ago

@hrvylein: Sharing the binary kernel package isn't something which bothers me in any way. There are several issues with that. First of all, it's not a good practice to install things provided by random people on forums/bug trackers. It's not just because someone might put something malicious in those packages on purpose. It can also be that their system is infected with something. This is in the end Valve's community issue tracker for SteamOS. They may not be OK with that. Another potential problem is that this wasn't fully tested. If the built package has a bug due to the use of a slightly newer compiler or some code is corrupted, data loss and other crashes are also possible.

Let's wait a bit until we hear from someone at Valve. They've likely been on a break during the holidays. This is a hobby for us. It's their job that they do on a daily basis.

Perhaps Valve will publish a SteamOS update once they get a chance to test the changes further.

@nultiee: When SteamOS shows the "Verifying installation" screen after a crash, it's just the UI restarting (with or without a GPU driver crash). You can easily check this in the logs. This is slightly better than a kernel crash which could indicate other kernel bugs or be caused by more serious hardware issues.

@RodoMa92: No, I've applied the sdma timeout patch on top of that kernel tree to build a package with the exact same version. The visible difference is that the kernel shows a different build date and that the files on disk have yesterday's timestamp, including the ramdisk and the kernel image.

This is the current status for the kernel trees: kernel 6.6.8 - includes Mario's patch, doesn't include the sdma timeout patch, known to crash kernel 6.1.52 on SteamOS - doesn't include Mario's patch, doesn't include the sdma timeout patch, known to crash my custom kernel 6.1.52 on SteamOS - doesn't include Mario's patch, includes the sdma timeout patch, still potentially broken

I've tested the custom kernel which has the sdma timeout patch. This is just Valve's 6.1.52-valve12-1 kernel to which I've applied the https://gitlab.freedesktop.org/drm/amd/uploads/b77399cdff3f6e7206dba43527804978/0001-drm-amdgpu-adjust-SDMA-timeout.patch patch and built it.

I had a crash yesterday with this patched kernel. The interesting thing was that the game froze to a black screen. The driver didn't crash this time. The SteamOS UI was still accessible while the game was frozen on that black screen. The game never crashed. It just got stuck in that state. Without closing the game, I started another game. The other game started just fine. I closed the other game. This one was still stuck. I had no messages related to amdgpu in the dmesg output.

There are a few scenarios:

  1. This wasn't the bug which was previously leading to the amdgpu crash. It might be a game bug which leads to this crash.
  2. This was the same bug as before. The sdma timeout patch just avoids the crash for the rest of the GUI stack while the game ends up in the same broken state.
  3. It's some other bug.

I'd modify amdgpu's code further to do the following:

This would let us know when the new timeout has masked the issue and the application has ended up in a bad state due to this driver failure or when the application continued to work properly. Otherwise we have no clue what's going on.

nultiee commented 8 months ago

3, unrelated to this. Feel free t

Thanks, It's definitely not a Baldurs Gate 3 issue. It occurs on multiple games. E.g. here's another recent one when playing Strange Horticulture: ``` Jan 04 00:04:43 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1130503, emitted seq=1130505 Jan 04 00:04:43 steamdeck kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Strange Horticu pid 9472 thread dxvk-submit pid 9528 Jan 04 00:04:43 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin! Jan 04 00:04:43 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: MODE2 reset Jan 04 00:04:43 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume Jan 04 00:04:43 steamdeck kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F43FC00000). Jan 04 00:04:43 steamdeck kernel: [drm] PSP is resuming... Jan 04 00:04:43 steamdeck (udev-worker)[9577]: devcd3: Process 'cat /sys/devices/virtual/devcoredump/devcd3/data > /var/lib/steamos-log-submitter/pending/devcoredump/4747' failed with exit code 1. Jan 04 00:04:43 steamdeck kernel: [drm] reserve 0xa00000 from 0xf43e000000 for PSP TMR Jan 04 00:04:43 steamdeck fancontrol.py[8164]: Traceback (most recent call last): Jan 04 00:04:43 steamdeck fancontrol.py[8164]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 542, in Jan 04 00:04:43 steamdeck fancontrol.py[8164]: controller.loop_control() Jan 04 00:04:43 steamdeck fancontrol.py[8164]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 486, in loop_control Jan 04 00:04:43 steamdeck fancontrol.py[8164]: self.loop_read_sensors() Jan 04 00:04:43 steamdeck fancontrol.py[8164]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 452, in loop_read_sensors Jan 04 00:04:43 steamdeck fancontrol.py[8164]: self.power_sensor.get_avg_value() Jan 04 00:04:43 steamdeck fancontrol.py[8164]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 356, in get_avg_value Jan 04 00:04:43 steamdeck fancontrol.py[8164]: self.values.append(self.get_value()) Jan 04 00:04:43 steamdeck fancontrol.py[8164]: ^^^^^^^^^^^^^^^^ Jan 04 00:04:43 steamdeck fancontrol.py[8164]: File "/usr/share/jupiter-fan-control/fancontrol.py", line 351, in get_value Jan 04 00:04:43 steamdeck fancontrol.py[8164]: self.value = int(f.read().strip()) / 1000000 Jan 04 00:04:43 steamdeck fancontrol.py[8164]: ^^^^^^^^ Jan 04 00:04:43 steamdeck fancontrol.py[8164]: PermissionError: [Errno 1] Operation not permitted Jan 04 00:04:43 steamdeck systemd[1]: jupiter-fan-control.service: Main process exited, code=exited, status=1/FAILURE Jan 04 00:04:44 steamdeck fancontrol.py[9581]: loaded critical temp from SSD hwmon: 94.85 Jan 04 00:04:44 steamdeck fancontrol.py[9581]: returning fan to EC control loop Jan 04 00:04:44 steamdeck systemd[1]: jupiter-fan-control.service: Failed with result 'exit-code'. Jan 04 00:04:44 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 2.158s CPU time. Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resuming... Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: SMU is resumed successfully! Jan 04 00:04:44 steamdeck kernel: [drm] DMUB hardware initialized: version=0x0300000A Jan 04 00:04:44 steamdeck kernel: [drm] Failed to add display topology, DTM TA is not initialized. Jan 04 00:04:44 steamdeck kernel: [drm] kiq ring mec 2 pipe 1 q 0 Jan 04 00:04:44 steamdeck kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode). Jan 04 00:04:44 steamdeck kernel: [drm] JPEG decode initialized successfully. Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow start Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: recover vram bo from shadow done Jan 04 00:04:44 steamdeck kernel: [drm] Skip scheduling IBs! Jan 04 00:04:44 steamdeck kernel: [drm] Skip scheduling IBs! Jan 04 00:04:44 steamdeck kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset(6) succeeded! Jan 04 00:04:44 steamdeck kernel: [drm] Skip scheduling IBs! Jan 04 00:04:45 steamdeck systemd[1]: jupiter-fan-control.service: Scheduled restart job, restart counter is at 3. Jan 04 00:04:45 steamdeck systemd[1]: Stopped Jupiter fan control. Jan 04 00:04:45 steamdeck systemd[1]: jupiter-fan-control.service: Consumed 2.158s CPU time. Jan 04 00:04:45 steamdeck systemd[1]: Started Jupiter fan control. Jan 04 00:04:45 steamdeck systemd[1]: Started Process Core Dump (PID 9623/UID 0). Jan 04 00:04:45 steamdeck fancontrol.py[9625]: loaded critical temp from SSD hwmon: 94.85 Jan 04 00:04:45 steamdeck fancontrol.py[9625]: jupiter-fan-control started successfully. Jan 04 00:04:45 steamdeck core_handler[9624]: Minidump generated at /var/lib/steamos-log-submitter/pending/minidump/.staging-1704326684-gamescope-xwm-8283-None.dmp Jan 04 00:04:45 steamdeck kernel: input: Steam Deck as /devices/pci0000:00/0000:00:08.1/0000:04:00.4/usb1/1-3/1-3:1.2/0003:28DE:1205.000A/input/input38 Jan 04 00:04:45 steamdeck systemd-coredump[9626]: Process 8283 (gamescope-wl) of user 1000 dumped core. Jan 04 00:04:46 steamdeck systemd[1]: systemd-coredump@2-9623-0.service: Deactivated successfully. ```

I'll raise a new separate ticket if you think its different though. I'm also going to opt into Beta again to see if I experience the same symptoms as @unclejack. As I was definitely getting them on the Beta build, but this was months ago.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1877076254

In both cases the trace mentions that the crashed ring is the gfx one, while all of the @unclejack reported and mine over there mentions only the sdma ring, so yeah, I think it would be better to separate the two.

This specific issue mentioned in the AMD bugzilla above is especially present when low loads are applied on the GPU, so it should be far harder to hit them while playing heavy GPU games.

On that regard, @unclejack in which cases can you reproduce this? If it's mainly when idling or watching videos it might just be the same issue as me then.

unclejack commented 8 months ago

@RodoMa92: I can reproduce the sdma0 crashes in game. It crashed once when loading. It's very likely not related to video decoding when I play the game. That one single crash on game startup could've been related to video decoding or it could also be the same sequence of commands which lead to the GPU hang. I've run into these issues on SteamOS so far. I haven't tried Bazzite or something else yet.

The more I think about the whole sdma timeout change, the more I think it's a hack which can't address the underlying issue. It also doesn't seem to help with games which get stuck on a black screen.

RodoMa92 commented 8 months ago

Replying to https://github.com/ValveSoftware/SteamOS/issues/1312#issuecomment-1877421710

That's fine, this still seems the same issue as mine. I mentioned video decoding since it's a lighter load on the APU, therefore more time spent in GFXOFF and more chances that the firmware does something wrong and crashed the driver. Mario confirmed that disabling GFXOFF did fix the issue (but broke sleep as a side effect), that plus the dma part makes me think of a weird race condition/contention with the shared memory on the GPU side.

The patch from Mario is a workaround while the firmware team investigate the crashes internally, so it might yet not cover all the conditions needed to avoid it it seems.

unclejack commented 8 months ago

The driver has crashed again with the patched kernel. The crash was identical to the previous ones. The workaround doesn't work.

That means there's no point in testing any kernel patches for the kernel right now. AMD needs to fix the closed source binary blob that is the firmware, as @RodoMa92 has mentioned above.

RodoMa92 commented 8 months ago

The driver has crashed again with the patched kernel. The crash was identical to the previous ones. The workaround doesn't work.

That means there's no point in testing any kernel patches for the kernel right now. AMD needs to fix the closed source binary blob that is the firmware, as @RodoMa92 has mentioned above.

Please leave this feedback to AMD on the bugtracker above, so they know that that hyphothesis doesn't cover all cases. Thanks :)

unclejack commented 8 months ago

I've left feedback for AMD. The stability of their hardware is lacking right now. I'm not convinced they don't also have some serious hardware bugs.

We've put a lot of time and effort into debugging these crashes. Please chime in if you experience these sdma timeout crashes.

My experience with the Steam Deck has a lot of negative aspects. The fan can run at a high speed when running the exact same thing it ran when the fan was running a lot slower. The Steam Deck gets extremely hot on the lower side after about 1h30. The wifi has weird stability issues. Power management and the firmware seem iffy (what's up with ASPM, CPPC and the firmware bugs?). A different Steam Deck unit takes about 1-2 minutes to boot on the stable OS. I'm now wondering if these SoCs have some defects which cause some of these issues.

Either way, you probably don't want to waste time with these patches. The Steam Deck still crashes with a patched kernel.

emcy849 commented 8 months ago

I'm now wondering if these SoCs have some defects which cause some of these issues.

Ive wondered this too. Doesnt look great if its the case, but hopefully it could be mitigated in software. There are two points against this hypothesis though, first is that people are still getting this crashing problem with the OLED, which is almost a complete internal hardware redesign including a new SoC on a smaller node. You might expect valve [or AMD as their silicon partner] to have caught and fixed silicon bugs in that case. Another is that the steam deck SoC actually doesnt appear to be entirely bespoke, apparently its most likely a custom SoC that is shared with a product called Magic Leap 2, and possibly others. I saw a silicon analysis video on youtube recently that indicated this, including a large chunk of disabled silicon in the Decks APU that is most likely disabled because its specific to the other product and nothing to do with Valve. You might expect AMD to have tested the SoC well enough to avoid such a nasty bug if theyre selling it to multiple customers.

So anyway, that would potentially be two points against a hardware bug hypothesis.

Im also still at a loss about why i never experienced this crashing problem before updating to OS 3.5, and bios 118 or whichever one it came with.

Rrauros commented 8 months ago

I'm now wondering if these SoCs have some defects which cause some of these issues.

I don't think there are any hardware defects in this case. Its probably software issue. I have 7900 xt and had these hangs in desktop with sdma timouts and ring gfx_0.0.0 timeouts. Switched to windows and hang issues are gone. If someone can install windows on Steam deck to see if the issues are resolved we can confirm this.

RodoMa92 commented 8 months ago

I still never can hit this reliably, so I hardly doubt that this specific issue is hardware in my opinion. I just managed to hit this once at random when I posted the log above and never again after hundreds of hours playing on it with a mix of AAA and 2D indie games. Besides, this seems to affect the whole RDNA2 architecture including dedicated GPUs, I highly doubt again that they sold all of this broken hardware.

I seriously hope that AMD wake the heck up and begin fixing their stuff, but at least with them a ton of stuff is fixable externally, unlike nVidia (not that these issues are an excuse to force the community doing their work).

However, a lot of other points made by @unclejack on firmware bugs are shared by myself (and I have encountered TON of firmware bugs/regressions) with no way to communicate with their department on status or remediations/workarounds. I still do not know if they actually have firmware engineers at Valve or if the whole Deck has been outsourced on the current running firmware, but it would not surprise me if it is since the sheer number of bugs on it.

unclejack commented 8 months ago

@Rrauros: Thanks for letting us know you also run into this with 7900 XT. We now know this bug affects RDNA (1), RDNA2 and RDNA3. The problem is common for all of them, regardless of what it is. Could you also post on the tracker here https://gitlab.freedesktop.org/drm/amd/-/issues/2220 to make sure people know you run into these issues with 7900 XT, please?

@emcy849: Based on the fact that this is common for all of the RDNA families, it's indeed unlikely that the Steam Deck's iGPU has a bug specific to it. There's not much point in installing Windows on the Steam Deck. We have confirmation from others who use these GPUs on Windows already. @Rrauros confirmed above that his RDNA3 7900 XT doesn't run into issues on Windows.

It's still likely that the Steam Deck's APU has other issues with CPPC, ASPM and other things. Given all the bugs we've run into, it's not that good. The thing is the Steam Deck is a bit more than yet another random amd64 machine sold on the PC market. It's meant to be something different. It has been something different for many in some ways. That's why it needs to be better.

@RodoMa92: You're right as far as I'm concerned. They do need to take this seriously. My impression so far is that they haven't done that yet.

If you run into these issues with a different configuration, please post on that tracker to make sure AMD pays attention.

emcy849 commented 8 months ago

Based on the fact that this is common for all of the RDNA families

Wow, ok. Well that sucks mightily. I suppose then the question is does such a bug exist in every unit, or is gambling on an RMA actually worth it.

FWIW, i had a 100% repro on this bug until recently. It was LCD deck on OS 3.5.5/3.5.7, bios 119, and trying to load up halo infinite campaign mode. You start the game, then switch over to campaign mode which seems to be a separate exe. It crashed every single time without fail, and i tried it multiple tens of times. Different proton versions had no effect on this. This changed when i went to bios 120 and now it mostly loads, but the game overall still crashes the same way sometimes. I also had a 100% repro on a non steam game called 'sonic CD restored', it would crash every time like clockwork at the exact same place in the intro sequence, just after the logos etc. Also alleviated somewhat by bios 120.