Kreyren commented 3 years ago

Expectation

Method to perform test of Video Random Access Memory (VRAM) for failed banks to diagnose https://github.com/Kreyren/kreyren/issues/84 and https://github.com/Kreyren/kreyren/issues/87 to know which bank has to be replaced without doing the crude-method[2].

Ideal

Software written in rust that doesn't depend on an operating system to perform the VRAM sanity check e.g. from a bootable device.

The result should be saved on the bootable device to allow identifying of a malfunctioning VRAM e.g. how memtest86 works.

Relevants

Method highlighted by MV TechLab (Youtube) utilizing a python script https://github.com/galkinvv/galkinvv.github.io/blob/master/direct-mem-test.py
If the system halts during the testing -> Lower the clocks[1.1]

Refernces

Reference to diagnose bad VRAM by MV TechLab (YouTube) using a python script https://youtu.be/UVi_UAc6L6M 1.1. Highlights issues with halting https://youtu.be/UVi_UAc6L6M?t=362 1.2. Repo available at https://github.com/galkinvv/galkinvv.github.io/blob/master/direct-mem-test.py
Crude-way to diagnose bad VRAM bank by MV TechLab (Youtube) https://www.youtube.com/watch?v=6LOp4IMulEk

Kreyren commented 2 years ago

CC @galkinvv Me and my team are trying to make a memtest for VRAM written in rustlang. Could you clarify how does your python script testing the GPU memory banks?

It seems to be using /dev/mem, but afaik those are for CPU RAM only? (i wasn't able to get helpful info from driver devs)

galkinvv commented 2 years ago

My python script is very simple. Each PCIe device has a memory BAR corresponding to a part of a CPUs physical memory. For GPUs it is at least 256MB memory segment that can be seen via lspci -v output

03:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1) (prog-if 00 [VGA controller]) Subsystem: Gigabyte Technology Co., Ltd GM200 [GeForce GTX 980 Ti] Flags: bus master, fast devsel, latency 0, IRQ 63 Memory at fa000000 (32-bit, non-prefetchable) [size=16M] Memory at b0000000 (64-bit, prefetchable) [size=256M] Memory at c0000000 (64-bit, prefetchable) [size=32M] I/O ports at e000 [size=128] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: Kernel driver in use: nvidia Kernel modules: nvidia

So, just opening /dev/mem and writing+reading that segement allow testing gpu memory... but it need two preconditions:

(very simple) PCIe bar must be enabled setpci -s 00:03:00.0 COMMAND=0x02
(very complex) GPU's asic init procedure must progress at least up to memory training and insiede-gpu mapping of a physical GPU memory to a logical addresses accessible via PCIe bar.
- if this is not done NVIDIA Gpu's reports byte patterns telling about bad access: bad0ac00 (last byte which is 00 here is a changing counter)
- Asic init can be executed either
- UEFI/BIOS during system post (primary GPU)
- driver loading (secondary gpu)
- But with broken memory asic init can hang system or bail put before performing memory mapping, or 1000 other ways to fail. 100's of hours spent on this problem didn't give any serious progress(

galkinvv commented 2 years ago

And about language choice. Video memory testing without GPU loaded driver most of the times is as simple as "iterate all memory and count errors". So note, there is no great performance-related bottlenecks here. And using python ensures opensourceness and simplifies local modifications by end users (unfortunately my https://github.com/galkinvv/galkinvv.github.io/blob/master/direct-mem-test.py became unreadable mess, but the language is not the problem)

If you want to practice rust - there is another gpu-memory-testing utility, where perfomance is more important and the code doesn't fit in one screen. It is testing memory for cases where driver installs and memory fails only on high clocks. It is https://github.com/galkinvv/memtestCL The main problem with it - it didn't output only aggregated count of errors but nothing about thier adresses. Aggregation is done via opencl, but something like "the smallest error address" and "the biggest error address" should be possible to add. However this requires OpenCL knowledge, which I lack of by now(

galkinvv commented 2 years ago

And 3rd precondition to test memory. All (amdgpu, nouveau, nvidia, maybe efifb) drivers possibly talking to card should be unloaded (rmmod) or inactive (nomodeset in kernel cmd line). Having a memory test and driver using the same PCIe bar to talking to card - will lead to bugs/hangs/crashes/incorrect results.

The only semicompatible is vesa/vga driver, which only uses first half of 256MB. So while vesa displaying is active the upper half of a pcie bar may be tested.

galkinvv commented 2 years ago

BTW, from a practical repairer point of view the "Crude" method from the video linked above is the most fast & reliable.

Its only disadvatntage - it can't be used if no picture is visible or there is no visible articats in the picture.

Kreyren commented 2 years ago

Thanks for all the info it's very appreciated and is a major help in developing this! ^-^

We are still processing the info

Kreyren / kreyren

Develop a method to perform sanity-check of VRAM with identification for failed banks #92

Expectation

Ideal

Relevants

Refernces