GpuZelenograd / memtest_vulkan

Vulkan compute tool for testing video memory stability
https://github.com/GpuZelenograd/memtest_vulkan/blob/main/Readme.md
zlib License
262 stars 12 forks source link

No GTT->VRAM unswapping with amdgpu + memtest_vulkan #10

Open T-X opened 1 year ago

T-X commented 1 year ago

While I was looking into some performance issues in games when VRAM is overallocated I had initially reported this issue here and used some scripting around your super useful memtest_vulkan tool (thanks again!) to benchmark. And got the response on the amd-gfx kernel mailing list that amdgpu should be able to move memory from GTT back to VRAM.

So I was wondering, is there maybe something memtest_vulkan might be doing in particular (or Vulkan in general?) that might hinder moving memory back from GTT to VRAM in my benchmark runs with memtest_vulkan? Specifically tests 5 and especially test 6 in the benchmarks I linked above seem to have an unexpectedly low performance.

(just asking in case you might have an idea - still feels more like an amdgpu issue (bug?) to me :smile: - and also mentioning it if there were maybe some features that could be interesting to add to memtest_vulkan itself that would have simplified the benchmarking script / that would have avoided those SIGSTOP'ing steps)

galkinvv commented 1 year ago

About memory types - memtest_vulkan performs all allocations at start: one small buffer in CPU+GPU-visible memory type for control info and error counting and a huge buffer used for memory testing in the memory type, accessible only from GPU. It never reallocates memory, this may be the reason.

I've performed test similar to yours, and came to the same result - while executing memtest_vulkan with all vram used - the next instance allocates memory in gtt area and did not migrate back to vram even when earlier instances are stopped, exited: radeontop shows that VRAM is freed but gtt is still full. The result is the same with "Decoding over 4GB enabled" in motherboard BIOS option (all VRAM is visible through large PCIe BAR) and with "Decoding over 4GB disabled" (only 256MB PCIe BAR is directly visible). The heaps output below is for large PCIe BAR, since it simplifies memory heaps.

The memtest_vulkan explicitely uses memory type selection, it an be seen in verbose mode: to enable verbose mode output copy/symlink memtest_vulkan to memtest_vulkan_verbose and run it.

While runing without parameters it outputs list of heaps at the start: Here is the output for RX6700XT 12GB GPU:

[user@host ~]$ VK_DRIVER_FILES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json ./memtest_vulkan_verbose
https://github.com/GpuZelenograd/memtest_vulkan v0.5.1 by GpuZelenograd
To finish testing use Ctrl+C
Verbose feature enabled (or 'verbose' found in name). Vulkan instance 1.3.235
...
1: Bus=0x03:00 DevId=0x73DF API 1.3.224  v22(0x5802003)  12GB AMD Radeon RX 6700 XT (RADV NAVI22)
Loading memory info for selected device index 0...
heap size  3.8GB budget  3.8GB usage  0.0GB flags=(empty)
heap size 12.0GB budget 12.0GB usage  0.0GB flags=DEVICE_LOCAL

The first heap seems to be some gtt-related, the second one seems to be real VRAM.

while running with gpu index and size parameters - it outputs memory types present and the memory type selected. The output here is for exactly that last instance that performed very slow (however the selected indices are identical to memory types used by first "fast" run).

[user@hosr ~]$ VK_DRIVER_FILES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json ./memtest_vulkan_verbose 1 4000000000
Verbose feature enabled (or 'verbose' found in name). Vulkan instance 1.3.235
Loading memory info for selected device index 0...
 0 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 1 } 
 1 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 1 } 
 2 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT, heap_index: 0 } 
 3 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT, heap_index: 1 } 
 4 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED, heap_index: 0 } 
 5 MemoryType { property_flags: DEVICE_LOCAL | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } 
 6 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 0 } 
 7 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } 
 8 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 0 } 
CoherentIO memory          type 3 inside heap MemoryHeap { size: 12868124672, flags: DEVICE_LOCAL }
Trying   3.725GB buffer...
Test memory size   3.7GB   type  0: MemoryType { property_flags: DEVICE_LOCAL, heap_index: 1 } MemoryHeap { size: 12868124672, flags: DEVICE_LOCAL }
Standard 5-minute test of 1: Bus=0x03:00 DevId=0x73DF API 1.3.224  v22(0x5802003)  12GB AMD Radeon RX 6700 XT (RADV NAVI22)

From this log it is seen that both memory types used - type 0 and type 3 - correspond to heap 1, not heap 0. So memtest_vulkan explicitely requests vrm, not gtt.

Note: I tried same experimnt with amdvlk driver (passing VK_DRIVER_FILES=/usr/share/vulkan/icd.d/amd_icd64.json) and while trying to run the overallocating memtest_vulkan insatnce the archlinux with 6.1.1 kernel hanged with "kernel NULL pointer dereference 0000000000000038" somewhere in "ttm_bo_set_bulk_move+0x41/0x80".

(my test system I used for expeeriments has only 8GB RAM, it may be not enough to correctly handle overallocating on 12GB GPU)

About features of memtest vulkan - by default it is designed to "allocate not too much memory to avoid driver issues ariding whle nearing to the VRAM limit, since both linux and windows may misbihave in such situation"

galkinvv commented 1 year ago

Since the amd-gfx answer was "It will get swapped back to VRAM if it makes sense for performance." - is is interesting to see how the "makes sense for performance" is determined.

Since memtest vulkan uses one very huge buffer and quite low count of connad submissions (by design) the criteria

The driver throttles swapping if there is too much contention to avoid the overhead of swapping large amounts of memory back and forth between vram and gtt for every command submission. may throttle it too much. If the criteria is in source code and not inside the GPU - it may be intersting to see source code of it.