Open tsg2k2 opened 1 year ago
In short: this is a known problem that is not easy to fix without sacrificing other goals.
Here is the more detailed description: initially the tool was targeted for testing the repaired/used hardware for correctness and such hardware tend to have some obscure problems that arises at special load patterns, not at "just full bandwidth utilization". And memtest_vulkan
tries hard to achieve a memory access pattern that tests not only the data input-output between GPU and memory, but also tries to make used memory address switching quite random and unpredictable. But reading just random addresses would be very slow. So the testing is done in a balance mode: by a semi-random reading of small consecutive blocks. I didn't check the profiler, but it seems that calculation of those addresses actually consumes a lot of GPU compute power, so the tool becomes slightly computation-bounded and fails to utilize all the theoretical bandwidth.
As a result, it performs testing with intensive switching of accessed addresses and tries hard in checking every value read back, but doesn't achieve maximal heating and power draw.
I have some ideas to try in utilizing more bandwidth with keeping complex access patterns, but by now they are only at "planning" stage.
Thanks for detailed explanation. I have a 3090 which fails under heavy load with event log entry nvlddmkm event 0.
My current suspicion is bad VRAM but your tools detects nothing, I'm assuming this is because it doesn't stress VRAM enough.
On Sat, Jan 14, 2023, 2:13 PM Vasily Galkin @.***> wrote:
In short: this is a known problem that is not easy to fix without sacrificing other goals.
Here is the more detailed description: initially the tool was targeted for testing the repaired/used hardware for correctness and such hardware tend to have some obscure problems that arises at special load patterns, not at "just full bandwidth utilization". And memtest_vulkan tries hard to achieve a memory access pattern that tests not only the data input-output between GPU and memory, but also tries to make used memory address switching quite random and unpredictable. But reading just random addresses would be very slow. So the testing is done in a balance mode: by a semi-random reading of small consecutive blocks. I didn't check the profiler, but it seems that calculation of those addresses actually consumes a lot of GPU compute power, so the tool becomes slightly computation-bounded and fails to utilize all the theoretical bandwidth.
As a result, it performs testing with intensive switching of accessed addresses and tries hard in checking every value read back, but doesn't achieve maximal heating and power draw.
I have some ideas to try in utilizing more bandwidth with keeping complex access patterns, but by now they are only at "planning" stage.
— Reply to this email directly, view it on GitHub https://github.com/GpuZelenograd/memtest_vulkan/issues/2#issuecomment-1382892129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4IXSNPM7NKNJMLQKSKNGTWSL3ERANCNFSM6AAAAAAT3LUWYY . You are receiving this because you authored the thread.Message ID: @.***>
Tool only reaches ~75%/80% of theoretical bandwidth on RTX 3090. Doesn't seem to be enough to stress test memory