GpuZelenograd / memtest_vulkan

Vulkan compute tool for testing video memory stability
https://github.com/GpuZelenograd/memtest_vulkan/blob/main/Readme.md
zlib License
262 stars 12 forks source link

AMD GPU wants "deviceCoherentMemory" to be enabled #1

Closed TheJackiMonster closed 1 year ago

TheJackiMonster commented 1 year ago

I tested release v0.4.0 on Archlinux. It seems to work fine but when I start it the following validation error will get printed:

Validation Error: [ VUID-vkAllocateMemory-deviceCoherentMemory-02790 ] Object 0: handle = 0x56198c13d100, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x8830dc95 | vkAllocateMemory: attempting to allocate memory type 5, which includes the VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD memory property, but the deviceCoherentMemory feature is not enabled. The Vulkan spec states: If the deviceCoherentMemory feature is not enabled, pAllocateInfo->memoryTypeIndex must not identify a memory type supporting VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAllocateMemory-deviceCoherentMemory-02790)

I use an RX 5700 as GPU with the RADV Vulkan drivers from the Mesa project. It seems to be an AMD related issue because it states that you will need to enable a feature from an AMD specific extension (VK_AMD_device_coherent_memory) to use the device coherent memory. Otherwise I don't get any validation errors.

Hope this helps!

galkinvv commented 1 year ago

Thanks for reporting.

Unfortunately I haven't access to RX5700 GPU to test the issue directly, but it seems that the memory type containing DEVICE_COHERENT_BIT_AMD flag is requested, while the test actually didn't need it. For the application logic this flag is a "some unused flag from some unused extension". And as far as I know the vulkan extensions are designed in a way "if an application doesn't know anything about extension and didn't request it - then application may completely ignore it". But this issue seems to break such design - and I really don't see a single-and-obvious way to fix the validation error.

So I've prepared a change that will try to select DEVICE_LOCAL|HOST_VISIBLE|HOST_COHERENT without any other flags and fallback for memory types with extra flags set if such memory type do not present.

The updated linux binaries are available as artifacts at https://github.com/GpuZelenograd/memtest_vulkan/actions/runs/3293700757

You can run ./memtest_vulkan from x86_64 archive to test if the validation error is gone.

However I'm not sure which memory types are advertised on RADV + RX5700. So, I also added listing of memory types in verbose mode. Renaming the binary ./memtest_vulkan_verbose will enable verbose output which contains a lot of info including list of memory heaps and memory types.

For RX580 with RADV this looks like

heap size  7.8GB budget  7.7GB usage  0.0GB flags=DEVICE_LOCAL
heap size  3.8GB budget  3.8GB usage  0.0GB flags=(empty)
heap size  0.2GB budget  0.2GB usage  0.0GB flags=DEVICE_LOCAL
......
 0 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 0 }
 1 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 0 }
 2 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT, heap_index: 1 }
 3 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT, heap_index: 2 }
 4 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED, heap_index: 1 }

So I'd appreciate posting the verbose output of version 0.4.1 on RX5700.

TheJackiMonster commented 1 year ago

Okay, I still get the validation layer output from before:

1: Bus=0x09:00 DevId=0x731F   8GB AMD Radeon RX 5700 (RADV NAVI10)
Validation Error: [ VUID-vkAllocateMemory-deviceCoherentMemory-02790 ] Object 0: handle = 0x55d362e1d180, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x8830dc95 | vkAllocateMemory: attempting to allocate memory type 7, which includes the VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD memory property, but the deviceCoherentMemory feature is not enabled. The Vulkan spec states: If the deviceCoherentMemory feature is not enabled, pAllocateInfo->memoryTypeIndex must not identify a memory type supporting VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAllocateMemory-deviceCoherentMemory-02790)

Here's the verbose output without the iterations:

Verbose feature enabled (or 'verbose' found in name). Vulkan instance 1.3.226
Available: VK_LAYER_RENDERDOC_Capture, VK_LAYER_VALVE_steam_fossilize_64, VK_LAYER_VALVE_steam_overlay_32, VK_LAYER_VALVE_steam_fossilize_32, VK_LAYER_NV_nomad_release_public_2020_1_0, VK_LAYER_VALVE_steam_overlay_64, VK_LAYER_MANGOHUD_overlay, VK_LAYER_MESA_device_select, VK_LAYER_LUNARG_screenshot, VK_LAYER_LUNARG_api_dump, VK_LAYER_LUNARG_monitor, VK_LAYER_KHRONOS_validation, VK_LAYER_MESA_overlay, VK_LAYER_INTEL_nullhw
Extensions: VK_KHR_device_group_creation, VK_KHR_display, VK_KHR_external_fence_capabilities, VK_KHR_external_memory_capabilities, VK_KHR_external_semaphore_capabilities, VK_KHR_get_display_properties2, VK_KHR_get_physical_device_properties2, VK_KHR_get_surface_capabilities2, VK_KHR_surface, VK_KHR_surface_protected_capabilities, VK_KHR_wayland_surface, VK_KHR_xcb_surface, VK_KHR_xlib_surface, VK_EXT_acquire_drm_display, VK_EXT_acquire_xlib_display, VK_EXT_debug_report, VK_EXT_debug_utils, VK_EXT_direct_mode_display, VK_EXT_display_surface_counter, VK_KHR_portability_enumeration

linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  

1: Bus=0x09:00 DevId=0x731F API v.1.3.224  8GB AMD Radeon RX 5700 (RADV NAVI10)
heap size  7.8GB budget  7.5GB usage  0.0GB flags=DEVICE_LOCAL
heap size 15.6GB budget 15.6GB usage  0.0GB flags=(empty)
heap size  0.2GB budget  0.1GB usage  0.0GB flags=DEVICE_LOCAL
Spawned child Child { stdin: None, stdout: None, stderr: None, .. } with PID 2341
Verbose feature enabled (or 'verbose' found in name). Vulkan instance 1.3.226
Available: VK_LAYER_RENDERDOC_Capture, VK_LAYER_VALVE_steam_fossilize_64, VK_LAYER_VALVE_steam_overlay_32, VK_LAYER_VALVE_steam_fossilize_32, VK_LAYER_NV_nomad_release_public_2020_1_0, VK_LAYER_VALVE_steam_overlay_64, VK_LAYER_MANGOHUD_overlay, VK_LAYER_MESA_device_select, VK_LAYER_LUNARG_screenshot, VK_LAYER_LUNARG_api_dump, VK_LAYER_LUNARG_monitor, VK_LAYER_KHRONOS_validation, VK_LAYER_MESA_overlay, VK_LAYER_INTEL_nullhw
Extensions: VK_KHR_device_group_creation, VK_KHR_display, VK_KHR_external_fence_capabilities, VK_KHR_external_memory_capabilities, VK_KHR_external_semaphore_capabilities, VK_KHR_get_display_properties2, VK_KHR_get_physical_device_properties2, VK_KHR_get_surface_capabilities2, VK_KHR_surface, VK_KHR_surface_protected_capabilities, VK_KHR_wayland_surface, VK_KHR_xcb_surface, VK_KHR_xlib_surface, VK_EXT_acquire_drm_display, VK_EXT_acquire_xlib_display, VK_EXT_debug_report, VK_EXT_debug_utils, VK_EXT_direct_mode_display, VK_EXT_display_surface_counter, VK_KHR_portability_enumeration

linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
Inserted device layer VK_LAYER_KHRONOS_validation (libVkLayer_khronos_validation.so)
Failed to find vkGetDeviceProcAddr in layer libVkLayer_MESA_device_select.so
       Using "AMD Radeon RX 5700 (RADV NAVI10)" with driver: "/usr/lib/libvulkan_radeon.so"

 0 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 0 } 
 1 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 0 } 
 2 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT, heap_index: 1 } 
 3 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT, heap_index: 2 } 
 4 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED, heap_index: 1 } 
 5 MemoryType { property_flags: DEVICE_LOCAL | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 0 } 
 6 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } 
 7 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 2 } 
 8 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } 
CoherentIO memory          type 3 inside heap MemoryHeap { size: 268435456, flags: DEVICE_LOCAL }
Trying   7.137GB buffer...
Validation Error: [ VUID-vkAllocateMemory-deviceCoherentMemory-02790 ] Object 0: handle = 0x55e5a1cee590, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x8830dc95 | vkAllocateMemory: attempting to allocate memory type 5, which includes the VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD memory property, but the deviceCoherentMemory feature is not enabled. The Vulkan spec states: If the deviceCoherentMemory feature is not enabled, pAllocateInfo->memoryTypeIndex must not identify a memory type supporting VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAllocateMemory-deviceCoherentMemory-02790)
Test memory size   7.1GB   type  5: MemoryType { property_flags: DEVICE_LOCAL | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 0 } MemoryHeap { size: 8321499136, flags: DEVICE_LOCAL }
Testing 1: Bus=0x09:00 DevId=0x731F API v.1.3.224  8GB AMD Radeon RX 5700 (RADV NAVI10)

Also I've additionally tested whether anything changes when turning resizable-bar on in the BIOS. The output looks like this:

Verbose feature enabled (or 'verbose' found in name). Vulkan instance 1.3.226
Available: VK_LAYER_RENDERDOC_Capture, VK_LAYER_VALVE_steam_fossilize_64, VK_LAYER_VALVE_steam_overlay_32, VK_LAYER_VALVE_steam_fossilize_32, VK_LAYER_NV_nomad_release_public_2020_1_0, VK_LAYER_VALVE_steam_overlay_64, VK_LAYER_MANGOHUD_overlay, VK_LAYER_MESA_device_select, VK_LAYER_LUNARG_screenshot, VK_LAYER_LUNARG_api_dump, VK_LAYER_LUNARG_monitor, VK_LAYER_KHRONOS_validation, VK_LAYER_MESA_overlay, VK_LAYER_INTEL_nullhw
Extensions: VK_KHR_device_group_creation, VK_KHR_display, VK_KHR_external_fence_capabilities, VK_KHR_external_memory_capabilities, VK_KHR_external_semaphore_capabilities, VK_KHR_get_display_properties2, VK_KHR_get_physical_device_properties2, VK_KHR_get_surface_capabilities2, VK_KHR_surface, VK_KHR_surface_protected_capabilities, VK_KHR_wayland_surface, VK_KHR_xcb_surface, VK_KHR_xlib_surface, VK_EXT_acquire_drm_display, VK_EXT_acquire_xlib_display, VK_EXT_debug_report, VK_EXT_debug_utils, VK_EXT_direct_mode_display, VK_EXT_display_surface_counter, VK_KHR_portability_enumeration

linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  

1: Bus=0x09:00 DevId=0x731F API v.1.3.224  8GB AMD Radeon RX 5700 (RADV NAVI10)
heap size 15.6GB budget 15.6GB usage  0.0GB flags=(empty)
heap size  8.0GB budget  7.6GB usage  0.0GB flags=DEVICE_LOCAL
Spawned child Child { stdin: None, stdout: None, stderr: None, .. } with PID 3245
Verbose feature enabled (or 'verbose' found in name). Vulkan instance 1.3.226
Available: VK_LAYER_RENDERDOC_Capture, VK_LAYER_VALVE_steam_fossilize_64, VK_LAYER_VALVE_steam_overlay_32, VK_LAYER_VALVE_steam_fossilize_32, VK_LAYER_NV_nomad_release_public_2020_1_0, VK_LAYER_VALVE_steam_overlay_64, VK_LAYER_MANGOHUD_overlay, VK_LAYER_MESA_device_select, VK_LAYER_LUNARG_screenshot, VK_LAYER_LUNARG_api_dump, VK_LAYER_LUNARG_monitor, VK_LAYER_KHRONOS_validation, VK_LAYER_MESA_overlay, VK_LAYER_INTEL_nullhw
Extensions: VK_KHR_device_group_creation, VK_KHR_display, VK_KHR_external_fence_capabilities, VK_KHR_external_memory_capabilities, VK_KHR_external_semaphore_capabilities, VK_KHR_get_display_properties2, VK_KHR_get_physical_device_properties2, VK_KHR_get_surface_capabilities2, VK_KHR_surface, VK_KHR_surface_protected_capabilities, VK_KHR_wayland_surface, VK_KHR_xcb_surface, VK_KHR_xlib_surface, VK_EXT_acquire_drm_display, VK_EXT_acquire_xlib_display, VK_EXT_debug_report, VK_EXT_debug_utils, VK_EXT_direct_mode_display, VK_EXT_display_surface_counter, VK_KHR_portability_enumeration

linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
linux_read_sorted_physical_devices:
     Original order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)
     Sorted order:
           [0] AMD Radeon RX 5700 (RADV NAVI10)  
Inserted device layer VK_LAYER_KHRONOS_validation (libVkLayer_khronos_validation.so)
Failed to find vkGetDeviceProcAddr in layer libVkLayer_MESA_device_select.so
       Using "AMD Radeon RX 5700 (RADV NAVI10)" with driver: "/usr/lib/libvulkan_radeon.so"

 0 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 1 } 
 1 MemoryType { property_flags: DEVICE_LOCAL, heap_index: 1 } 
 2 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT, heap_index: 0 } 
 3 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT, heap_index: 1 } 
 4 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED, heap_index: 0 } 
 5 MemoryType { property_flags: DEVICE_LOCAL | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } 
 6 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 0 } 
 7 MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } 
 8 MemoryType { property_flags: HOST_VISIBLE | HOST_COHERENT | HOST_CACHED | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 0 } 
CoherentIO memory          type 3 inside heap MemoryHeap { size: 8573157376, flags: DEVICE_LOCAL }
Trying   7.162GB buffer...
Validation Error: [ VUID-vkAllocateMemory-deviceCoherentMemory-02790 ] Object 0: handle = 0x55c79d04a560, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x8830dc95 | vkAllocateMemory: attempting to allocate memory type 7, which includes the VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD memory property, but the deviceCoherentMemory feature is not enabled. The Vulkan spec states: If the deviceCoherentMemory feature is not enabled, pAllocateInfo->memoryTypeIndex must not identify a memory type supporting VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAllocateMemory-deviceCoherentMemory-02790)
Test memory size   7.2GB   type  7: MemoryType { property_flags: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | DEVICE_COHERENT_AMD | DEVICE_UNCACHED_AMD, heap_index: 1 } MemoryHeap { size: 8573157376, flags: DEVICE_LOCAL }
Testing 1: Bus=0x09:00 DevId=0x731F API v.1.3.224  8GB AMD Radeon RX 5700 (RADV NAVI10)

I hope this helps!

galkinvv commented 1 year ago

Thanks for providing log! Now I see that there is a lot of memory types with DEVICE_*_AMD flags.

The test uses 2 memory allocations - one huge allocation for actual testing and a small allocation accessible from CPU and GPU for their communication through PCIe BAR. The list of memory types shows that both allocations should avoid accidental usage of unknown memory flags, and in the first try to fix only one of them was touched.

Here is new artifacts: https://github.com/GpuZelenograd/memtest_vulkan/actions/runs/3304856053 I plan to publish 0.4.1 release If it fixes the validation issue.


P.S. According to logs enabling resizable bar leads to merging of "huge entire device-local heap" and "device-local heap part accessible" from CPU.

Separate 7.8 and 0.2GB heaps with small BAR size:

heap size  7.8GB budget  7.5GB usage  0.0GB flags=DEVICE_LOCAL
heap size 15.6GB budget 15.6GB usage  0.0GB flags=(empty)
heap size  0.2GB budget  0.1GB usage  0.0GB flags=DEVICE_LOCAL

Combined 8.0 GB heap with BAR-resized-to-entire memory:

heap size 15.6GB budget 15.6GB usage  0.0GB flags=(empty)
heap size  8.0GB budget  7.6GB usage  0.0GB flags=DEVICE_LOCAL```

This is mostly expected; however I supposed that the 7.8GBs of the hugest heap without resizable-bar includes the 0.2GB CPU visible area as a "sub-heap". But according to the results - it is completely separate since the size is summed after enabling resizable bar. It isn't a problem for >=2GB GPUs, but for extremely-low-end-but-modern gpus like GCN3 Radeon 530 1GB - wasting 256MB from main heap is a huge loss. Fortunately such GPUS are extremely rare.

TheJackiMonster commented 1 year ago

It seems to be fixed. I don't get the validation error anymore. ^^

Good job!

galkinvv commented 1 year ago

The discussed fixed included in next release: https://github.com/GpuZelenograd/memtest_vulkan/releases/tag/v0.5.0