GpuZelenograd / memtest_vulkan

Vulkan compute tool for testing video memory stability
https://github.com/GpuZelenograd/memtest_vulkan/blob/main/Readme.md
zlib License
262 stars 12 forks source link

rtx 3060 gets checkered screen freeze after short time #25

Closed cognitivetech closed 6 months ago

cognitivetech commented 7 months ago

log output nothing unusual:

Tester console logging started at 2023-12-06T19:26:41.868859Z

1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
2: Bus=0x00:00 DevId=0x0000   2GB llvmpipe (LLVM 12.0.0, 256 bits)
Tester worker logging started at 2023-12-06T19:26:47.192084Z
Standard 5-minute test of 1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
      1 iteration. Passed  0.0592 seconds  written:    7.2GB 311.8GB/sec        checked:   10.9GB 302.8GB/sec
     18 iteration. Passed  1.0195 seconds  written:  123.2GB 306.1GB/sec        checked:  184.9GB 299.7GB/sec
    101 iteration. Passed  5.0456 seconds  written:  601.8GB 301.9GB/sec        checked:  902.6GB 295.7GB/sec
    598 iteration. Passed 30.0469 seconds  written: 3603.2GB 303.1GB/sec        checked: 5404.9GB 297.7GB/sec
   1094 iteration. Passed 30.0452 seconds  written: 3596.0GB 302.9GB/sec        checked: 5394.0GB 296.8GB/sec
Tester console logging started at 2023-12-06T19:44:45.371219Z

1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
2: Bus=0x00:00 DevId=0x0000   2GB llvmpipe (LLVM 12.0.0, 256 bits)
Tester worker logging started at 2023-12-06T19:44:48.721239Z
Standard 5-minute test of 1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
      1 iteration. Passed  0.0573 seconds  written:    7.2GB 314.5GB/sec        checked:   10.9GB 318.0GB/sec
     18 iteration. Passed  1.0042 seconds  written:  123.2GB 304.9GB/sec        checked:  184.9GB 308.1GB/sec
    102 iteration. Passed  5.0157 seconds  written:  609.0GB 301.6GB/sec        checked:  913.5GB 304.9GB/sec
    605 iteration. Passed 30.0524 seconds  written: 3646.8GB 301.5GB/sec        checked: 5470.1GB 304.6GB/sec
   1107 iteration. Passed 30.0507 seconds  written: 3639.5GB 300.9GB/sec        checked: 5459.2GB 304.0GB/sec
Tester console logging started at 2023-12-06T21:21:09.267532Z

1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
2: Bus=0x00:00 DevId=0x0000   2GB llvmpipe (LLVM 12.0.0, 256 bits)
Tester worker logging started at 2023-12-06T21:21:15.257429Z
Standard 5-minute test of 1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
      1 iteration. Passed  0.0577 seconds  written:    7.2GB 314.7GB/sec        checked:   10.9GB 313.9GB/sec
     19 iteration. Passed  1.0367 seconds  written:  130.5GB 314.4GB/sec        checked:  195.8GB 314.9GB/sec
    104 iteration. Passed  5.0253 seconds  written:  616.2GB 306.4GB/sec        checked:  924.4GB 306.7GB/sec
Tester console logging started at 2023-12-07T00:41:09.628432Z

1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
2: Bus=0x00:00 DevId=0x0000   2GB llvmpipe (LLVM 12.0.0, 256 bits)
Tester worker logging started at 2023-12-07T00:41:13.233889Z
Standard 5-minute test of 1: Bus=0x07:00 DevId=0x2504   12GB NVIDIA GeForce RTX 3060
      1 iteration. Passed  0.0553 seconds  written:    7.0GB 315.1GB/sec        checked:   10.5GB 317.0GB/sec
     19 iteration. Passed  1.0207 seconds  written:  126.0GB 307.0GB/sec        checked:  189.0GB 309.7GB/sec
    107 iteration. Passed  5.0279 seconds  written:  616.0GB 304.6GB/sec        checked:  924.0GB 307.4GB/sec
    630 iteration. Passed 30.0137 seconds  written: 3661.0GB 302.7GB/sec        checked: 5491.5GB 306.4GB/sec

I must reboot to restore function. Mostly this card is working fine, but using this test, and under random occasion with heavy load getting this problem, but I can't diagnose.

galkinvv commented 6 months ago

While memtest_vulkan tries to use simplest GPU commands to avoid situation "GPU problems lead to hang before logging errors" - this is not always possible - somtimes errors appear "atomically" - it is "all working then completely hang".

The checkerboard pattern during hang almost always means hardware problems. Sometimes those can be solved by under-clocking GPU and memory (start with a extreme undercloking of both GPU ans memory to find if it helps at all; if it helps find, then find a max stable clokcs) . Adding Option "Coolbits" "28" line into xorg.conf enables underclocking RTX 30x0 via nvidia-settings GUI.

Underclocking this way allows achieving smallest clocks for testing purposes, but is not compatible with wayland.

I had no experience/success with other methods mentined in Arch wiki, amybe some of them works fine.

I'll convert this to a card-specific didcussion since this is not a memtest_vulkan problem.