GpuZelenograd / memtest_vulkan

Vulkan compute tool for testing video memory stability
https://github.com/GpuZelenograd/memtest_vulkan/blob/main/Readme.md
zlib License
291 stars 14 forks source link

radv/amdgpu: Failed to allocate a buffer #16

Open jhumlick opened 1 year ago

jhumlick commented 1 year ago

I have a RX 7900 XTX, and it looks like not all of my memory is being tested. When I launch normally, I see:

1: Bus=0x09:00 DevId=0x744C 24GB AMD Radeon RX 7900 XTX (RADV GFX1100) radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 23927123968 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 23507693568 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 23088263168 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 22668832768 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 22249402368 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 21829971968 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 21410541568 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 20991111168 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 20571680768 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 20152250368 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 19732819968 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 19313389568 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 18893959168 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 18474528768 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 18055098368 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 17635667968 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 17216237568 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 radv/amdgpu: Failed to allocate a buffer: radv/amdgpu: size : 16796807168 bytes radv/amdgpu: alignment : 262144 bytes radv/amdgpu: domains : 4 Standard 5-minute test of 1: Bus=0x09:00 DevId=0x744C 24GB AMD Radeon RX 7900 XTX (RADV GFX1100) 1 iteration. Passed 0.0310 seconds written: 11.2GB 864.3GB/sec checked: 15.0GB 832.2GB/sec

I have resizable BAR turned on in my bios.

I will attach the output I see running with the file renamed to memtest_vulkan_verbose.

It also appears that the tool crashes if I write to a log file with tee using a pipe. (i.e. ./memtest_vulkan_verbose | tee memtest_vulkan_verbose.txt will crash when ctrl+c is pressed) memtest_vulkan_verbose.txt

galkinvv commented 1 year ago

In short: thanks for reporting. This is known behavior and is mostly harmless. (I've noted the crash on Ctrl+C while piped as a separate issue #17).

Detailed: The reason of a problem is a some incompatibility between RADV driver and memtest_vulkan. memtest_vulkan insists on allocating contiguous buffer (this is a technical design choice), but RADV is failing to allocate such buffer contiguously. However, the good news: being able to test only 15.0GB out of 24.0 GB is not a problem for nearly all usages, since

So the for 99% cases - this is just fine. The Failed to allocate a buffer messages above are generated by a driver, and from the memtest_vualkn usage scenario can be just ignored. I can't make the driver less verbose since it is driver's unconditional stderr output

Actually some other drivers sometimes fails allocations of large contiguous buffers too, memtest_vulkan silently auto-selects a bit smaller, it's just ok.

As a half fix - I plan to detect RADV driver and apply some minor tunes:

Also, there is a chance that the 16GB limit is specified in some limits exposed by the driver, but I'm not sure. Can you please install the package containing the vulkaninfo utility (something like vulkan-tools) and attach the output of VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json vulkaninfo ?

A side note: AMD GPUs on Linux has 2 different opensource vulkan drivers that can be installed simultaneously without severe conflicts. So you can additionally install the AMDVLK driver, and specify the driver to use by a VK_ICD_FILENAMES environment variable: amd_icd64 is AMDVLK, and radeon_icd.x86_64 is the RADV you are using now. (for your vulkan loader's libvulkan1 version. Newer libvulkan1 loader version renamed that var to VK_DRIVER_FILES). So, with AMDVLK driver additionally installed you can run

[user@host]$ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/amd_icd64.json ./memtest_vulkan
[user@host]$ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json ./memtest_vulkan

I have no 16+GB AMD GPU, but from my experience with 12 GB RX6700 - AMDVLK allows testing 10.5GB (1.5GB is auto-skipped by memtest_vulkan to avoid desktop lockup), while RADV limits it to 4GB. So I hope that for RX 7900 the AMDVLK can allow going above 16GB.

The same environment var applies to all other vulkan apps, including vulkaninfo

jhumlick commented 1 year ago

Thanks for the detailed reply!

I had thought I had both drivers installed but upon further inspection, discovered that I only had the RADV driver installed. I had to do a lot of package jumbling in order to get versions of MESA and a kernel that would support this GPU, so I guess I lost the AMDVLK driver in the process. Once I installed the AMD driver again, memtest_vulkan prompted me to select which driver to use. The issue did not appear with the AMDVLK driver, and I was able to stress test my system for an hour or so without any issues.

Thanks again for your help!

Also, thanks for letting me know about the VK_ICD_FILENAMES and VK_DRIVER_FILES environment variables. I previously only knew that I could specify to use the RADV driver via AMD_VULKAN_ICD=RADV, back when I wasn't using a bleeding-edge MESA and kernel, and had both drivers installed. ;-)