Kreyren / kreyren

Personal tracking for issues that i need to resolve to be used as a reference for someone else and/or for peer-review of the solution
GNU General Public License v3.0
3 stars 0 forks source link

Develop a method to perform sanity-check of VRAM with identification for failed banks #92

Open Kreyren opened 3 years ago

Kreyren commented 3 years ago

Expectation

Method to perform test of Video Random Access Memory (VRAM) for failed banks to diagnose https://github.com/Kreyren/kreyren/issues/84 and https://github.com/Kreyren/kreyren/issues/87 to know which bank has to be replaced without doing the crude-method[2].

Ideal

Software written in rust that doesn't depend on an operating system to perform the VRAM sanity check e.g. from a bootable device.

The result should be saved on the bootable device to allow identifying of a malfunctioning VRAM e.g. how memtest86 works.

Relevants

  1. Method highlighted by MV TechLab (Youtube) utilizing a python script https://github.com/galkinvv/galkinvv.github.io/blob/master/direct-mem-test.py
  2. If the system halts during the testing -> Lower the clocks[1.1]

Refernces

  1. Reference to diagnose bad VRAM by MV TechLab (YouTube) using a python script https://youtu.be/UVi_UAc6L6M 1.1. Highlights issues with halting https://youtu.be/UVi_UAc6L6M?t=362 1.2. Repo available at https://github.com/galkinvv/galkinvv.github.io/blob/master/direct-mem-test.py
  2. Crude-way to diagnose bad VRAM bank by MV TechLab (Youtube) https://www.youtube.com/watch?v=6LOp4IMulEk
Kreyren commented 2 years ago

CC @galkinvv Me and my team are trying to make a memtest for VRAM written in rustlang. Could you clarify how does your python script testing the GPU memory banks?

It seems to be using /dev/mem, but afaik those are for CPU RAM only? (i wasn't able to get helpful info from driver devs)

galkinvv commented 2 years ago

My python script is very simple. Each PCIe device has a memory BAR corresponding to a part of a CPUs physical memory. For GPUs it is at least 256MB memory segment that can be seen via lspci -v output

03:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1) (prog-if 00 [VGA controller]) Subsystem: Gigabyte Technology Co., Ltd GM200 [GeForce GTX 980 Ti] Flags: bus master, fast devsel, latency 0, IRQ 63 Memory at fa000000 (32-bit, non-prefetchable) [size=16M] Memory at b0000000 (64-bit, prefetchable) [size=256M] Memory at c0000000 (64-bit, prefetchable) [size=32M] I/O ports at e000 [size=128] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: Kernel driver in use: nvidia Kernel modules: nvidia

So, just opening /dev/mem and writing+reading that segement allow testing gpu memory... but it need two preconditions:

galkinvv commented 2 years ago

And about language choice. Video memory testing without GPU loaded driver most of the times is as simple as "iterate all memory and count errors". So note, there is no great performance-related bottlenecks here. And using python ensures opensourceness and simplifies local modifications by end users (unfortunately my https://github.com/galkinvv/galkinvv.github.io/blob/master/direct-mem-test.py became unreadable mess, but the language is not the problem)

If you want to practice rust - there is another gpu-memory-testing utility, where perfomance is more important and the code doesn't fit in one screen. It is testing memory for cases where driver installs and memory fails only on high clocks. It is https://github.com/galkinvv/memtestCL The main problem with it - it didn't output only aggregated count of errors but nothing about thier adresses. Aggregation is done via opencl, but something like "the smallest error address" and "the biggest error address" should be possible to add. However this requires OpenCL knowledge, which I lack of by now(

galkinvv commented 2 years ago

And 3rd precondition to test memory. All (amdgpu, nouveau, nvidia, maybe efifb) drivers possibly talking to card should be unloaded (rmmod) or inactive (nomodeset in kernel cmd line). Having a memory test and driver using the same PCIe bar to talking to card - will lead to bugs/hangs/crashes/incorrect results.

The only semicompatible is vesa/vga driver, which only uses first half of 256MB. So while vesa displaying is active the upper half of a pcie bar may be tested.

galkinvv commented 2 years ago

BTW, from a practical repairer point of view the "Crude" method from the video linked above is the most fast & reliable.

Its only disadvatntage - it can't be used if no picture is visible or there is no visible articats in the picture.

Kreyren commented 2 years ago

Thanks for all the info it's very appreciated and is a major help in developing this! ^-^

We are still processing the info