Tool doesn't utilize bandwith fully

tsg2k2 commented 1 year ago

Tool only reaches ~75%/80% of theoretical bandwidth on RTX 3090. Doesn't seem to be enough to stress test memory

   6547 iteration. Passed 30.0083 seconds  written:10766.2GB 758.7GB/sec        checked:12919.5GB 816.7GB/sec
   7140 iteration. Passed 30.0201 seconds  written:10748.1GB 756.2GB/sec        checked:12897.8GB 816.0GB/sec
   7733 iteration. Passed 30.0018 seconds  written:10748.1GB 756.2GB/sec        checked:12897.8GB 816.9GB/sec
   8327 iteration. Passed 30.0347 seconds  written:10766.2GB 756.8GB/sec        checked:12919.5GB 817.2GB/sec

galkinvv commented 1 year ago

In short: this is a known problem that is not easy to fix without sacrificing other goals.

Here is the more detailed description: initially the tool was targeted for testing the repaired/used hardware for correctness and such hardware tend to have some obscure problems that arises at special load patterns, not at "just full bandwidth utilization". And memtest_vulkan tries hard to achieve a memory access pattern that tests not only the data input-output between GPU and memory, but also tries to make used memory address switching quite random and unpredictable. But reading just random addresses would be very slow. So the testing is done in a balance mode: by a semi-random reading of small consecutive blocks. I didn't check the profiler, but it seems that calculation of those addresses actually consumes a lot of GPU compute power, so the tool becomes slightly computation-bounded and fails to utilize all the theoretical bandwidth.

As a result, it performs testing with intensive switching of accessed addresses and tries hard in checking every value read back, but doesn't achieve maximal heating and power draw.

I have some ideas to try in utilizing more bandwidth with keeping complex access patterns, but by now they are only at "planning" stage.

tsg2k2 commented 1 year ago

Thanks for detailed explanation. I have a 3090 which fails under heavy load with event log entry nvlddmkm event 0.

My current suspicion is bad VRAM but your tools detects nothing, I'm assuming this is because it doesn't stress VRAM enough.

On Sat, Jan 14, 2023, 2:13 PM Vasily Galkin @.***> wrote:

In short: this is a known problem that is not easy to fix without sacrificing other goals.

Here is the more detailed description: initially the tool was targeted for testing the repaired/used hardware for correctness and such hardware tend to have some obscure problems that arises at special load patterns, not at "just full bandwidth utilization". And memtest_vulkan tries hard to achieve a memory access pattern that tests not only the data input-output between GPU and memory, but also tries to make used memory address switching quite random and unpredictable. But reading just random addresses would be very slow. So the testing is done in a balance mode: by a semi-random reading of small consecutive blocks. I didn't check the profiler, but it seems that calculation of those addresses actually consumes a lot of GPU compute power, so the tool becomes slightly computation-bounded and fails to utilize all the theoretical bandwidth.

As a result, it performs testing with intensive switching of accessed addresses and tries hard in checking every value read back, but doesn't achieve maximal heating and power draw.

I have some ideas to try in utilizing more bandwidth with keeping complex access patterns, but by now they are only at "planning" stage.

— Reply to this email directly, view it on GitHub https://github.com/GpuZelenograd/memtest_vulkan/issues/2#issuecomment-1382892129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4IXSNPM7NKNJMLQKSKNGTWSL3ERANCNFSM6AAAAAAT3LUWYY . You are receiving this because you authored the thread.Message ID: @.***>

GpuZelenograd / memtest_vulkan

Tool doesn't utilize bandwith fully #2