KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
751 stars 401 forks source link

Failures and hangs for GeForce GTX 950 on Ubuntu 22.04 #4716

Closed lunarpapillo closed 10 months ago

lunarpapillo commented 1 year ago

This machine will replace an Ubuntu 16.04 machine with the same GPU.

8 failed tests

1 hanging test


For reference, on Ubuntu 16.04, most of these tests passed (two were skipped): http://erusea:8080/job/Vulkan-ValidationLayers/9191/BITS=64,BUILD_MODE=Release,USE_ROBIN_HOOD_HASHING=OFF,label=Aurelia-Linux-Nvidia/artifact/vulkantest-results/execution-logs/009-vk_layer_validation_tests-info.txt

...
[ RUN      ] VkLayerTest.TestBindBufferMemoryDeviceGroup
     TEST SKIPPED: Test requires a physical device group with more than 1 device.
[       OK ] VkLayerTest.TestBindBufferMemoryDeviceGroup (47 ms)
...
[ RUN      ] VkLayerTest.ValidateImportMemoryHandleType
[       OK ] VkLayerTest.ValidateImportMemoryHandleType (202 ms)
...
[ RUN      ] VkLayerTest.DuplicatePhysicalDevices
[       OK ] VkLayerTest.DuplicatePhysicalDevices (248 ms)
...
[ RUN      ] VkLayerTest.InvalidImageCreateFlagWithPhysicalDeviceCount
[       OK ] VkLayerTest.InvalidImageCreateFlagWithPhysicalDeviceCount (197 ms)
...
[ RUN      ] VkLayerTest.TransferImageToSwapchainWithInvalidLayoutDeviceGroup
[       OK ] VkLayerTest.TransferImageToSwapchainWithInvalidLayoutDeviceGroup (652 ms)
...
[ RUN      ] VkLayerTest.InvalidDeviceMask
[       OK ] VkLayerTest.InvalidDeviceMask (20315 ms)
...
[ RUN      ] VkPositiveLayerTest.TransferImageToSwapchainDeviceGroup
[       OK ] VkPositiveLayerTest.TransferImageToSwapchainDeviceGroup (384 ms)
...
[ RUN      ] VkPositiveLayerTest.ImagelessLayoutTracking
[       OK ] VkPositiveLayerTest.ImagelessLayoutTracking (312 ms)
...
[ RUN      ] VkLayerTest.PresentIdWait
Device extension VK_KHR_present_wait is not supported
             TEST SKIPPED: Error initializing extensions or retrieving features, skipping test
[       OK ] VkLayerTest.PresentIdWait (112 ms)
...

The full test logfile is attached; it could be useful for determining what went wrong with any particular test: blacklist.txt

spencer-lunarg commented 1 year ago

A lot of these have Unexpected: Validation Error: [ VUID-VkPhysicalDeviceGroupProperties-sType-sType ]

Which is something I ran into, the issue is 99% the VK_LAYER_MESA_device_select layer being old. Updating the layer to the newest version should fix all of these I think

https://gitlab.freedesktop.org/mesa/mesa/-/commit/4588453815c58ec848b0ff6f18a08836e70f55df

spencer-lunarg commented 1 year ago

also related https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/4674 seems NODEVICE_SELECT is one way to get around this (cc @juan-lunarg)

lunarpapillo commented 1 year ago

NODEVICE_SELECT=1 fixes some, but not all of the tests... the remaining two failures (on the latest code, which includes a fix for the hang in PresentIdWait) are:

I'll see about either getting a newer VK_LAYER_MESA_device_select, or talk to @juan-lunarg about getting NODEVICE_SELECT=1 set.

spencer-lunarg commented 1 year ago

for VkLayerTest.ValidateImportMemoryHandleType seems that either buffer_export.init_no_mem(*m_device, buffer_info); or memory_buffer_export.init(*m_device, alloc_info); is failing in the tests and causing the vkBindBufferMemory to fail with a VK_NULL_HANDLE passed in

juan-lunarg commented 1 year ago

VkLayerTest.PresentIdWait appears to be a driver bug from Nvidia. I worked on this with Charles yesterday and it seems the extension is just broken on Linux.

lunarpapillo commented 1 year ago

it seems the extension is just broken on Linux.

on all Linux, or just NVIDIA Linux?

juan-lunarg commented 1 year ago

on all Linux, or just NVIDIA Linux?

NVIDIA Linux not sure about other GPUs

lunarpapillo commented 1 year ago

I just talked with @charles-lunarg and @juan-lunarg about how to handle this in CI. We discussed:

Looking for insights and other alternatives from the VVL developers...

jeremyg-lunarg commented 1 year ago

Are these problems bugs in VK_LAYER_MESA_device_select? If so could we file an Issue/PR to get them fixed?

In the short term, it seems like the 2nd or 3rd solution is 'best', since testing in non-user configuration will eventually cause much confusion.

juan-lunarg commented 1 year ago

Are these problems bugs in VK_LAYER_MESA_device_select? If so could we file an Issue/PR to get them fixed?

At least for https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/4674 the issue has already been fixed in mesa

Not sure about all of these tests however.

lunarpapillo commented 1 year ago

The bugs have been fixed in some version of Mesa... but hasn't propagated to the default in Ubuntu yet, and I'm not sure when it will.

But that does bring up a fourth alternative, to ensure that the VK_LAYER_MESA_device_select layer is at least some "known good" version on all CI systems... (someone remind me what that version number is, again?)... I'll edit the above to add the fourth.

TonyBarbour commented 1 year ago

Assuming the device_select layer is there due to iGfx being present, would disabling iGfx from the bios fix the issue?

On Fri, Oct 28, 2022 at 3:33 PM Bob Ellison @.***> wrote:

The bugs have been fixed in some version of Mesa... but hasn't propagated to the default in Ubuntu yet, and I'm not sure when it will.

But that does bring up a fourth alternative, to ensure that the VK_LAYER_MESA_device_select layer is at least some "known good" version on all CI systems... (someone remind me what that version number is, again?)... I'll edit the above to add the fourth.

— Reply to this email directly, view it on GitHub https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/4716#issuecomment-1295487090, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXAUZ3A7TFKGJ3K2UU3PCLWFRBA5ANCNFSM6AAAAAARPXGDQI . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

-- Tony Barbour LunarG @. @.>

lunarpapillo commented 1 year ago

Assuming the device_select layer is there due to iGfx being present,

I don't think @johnzupin deliberately installed an Intel driver, so I'm thinking it would appear in any Ubuntu 22.04 installation... but John can hopefully provide more insight.

would disabling iGfx from the bios fix the issue?

Even if it did, I'd consider this a last resort. I don't think typical end users do this, and I'd like CI machines to reflect typical configuration whenever possible. I've seen this sort of fix lead to issues that end users see that cannot be detected in CI... and I'm hoping one day to be able expand CI to be able to run against the Intel drivers too.

charles-lunarg commented 1 year ago

Disabling the graphics in the bios wouldn't fix the issue, unless the 'bad layer' is removed as a part of the driver installation/removal process. The layer is its own thing and I don't think the layer has any mechanism to check for bios level settings.

juan-lunarg commented 1 year ago

for VkLayerTest.ValidateImportMemoryHandleType seems that either buffer_export.init_no_mem(*m_device, buffer_info); or memory_buffer_export.init(*m_device, alloc_info); is failing in the tests and causing the vkBindBufferMemory to fail with a VK_NULL_HANDLE passed in

memory_buffer_import init fails.

juan-lunarg commented 1 year ago

I verified that the PresentIdWait issue is definitely an Nvidia bug. It fails to pass ALL relevant CTS tests.

juan-lunarg commented 1 year ago

VkLayerTest.ValidateImportMemoryHandleType does look like an issue on our side. See PR https://github.com/KhronosGroup/Vulkan-ValidationLayers/pull/4748

juan-lunarg commented 1 year ago

VkLayerTest.ValidateImportMemoryHandleType is now fixed.

juan-lunarg commented 1 year ago

Closing since CI has been updated: https://github.com/LunarG/VulkanTests/commit/bb6c84f4875439a67c51b75e58e5690303b8ce20

lunarpapillo commented 1 year ago

Waitaminnit... I thought these issues remained until all the blacklisted tests are either fixed or encoded into the internal VVL blacklist...? I show several tests still failing:

juan-lunarg commented 1 year ago

Waitaminnit... I thought these issues remained until all the blacklisted tests are either fixed or encoded into the internal VVL blacklist...? I show several tests still failing:

Apologies I didn't understand the protocol. I made an incorrect assumption.

lunarpapillo commented 1 year ago

Unexpectedly, NODEVICE_SELECT=1 breaks Vulkan-ExtensionLayer tests and gfxreconstruct tests. These may indicate device-ordering dependencies in these repositories. Waiting to see if @jeremyg-lunarg or GFXR engineers have any insights.

If we can't set NODEVICE_SELECT=1 for all Linux CI, we can set it just for VVL.

lunarpapillo commented 1 year ago

This changes sets the variable just for VVL: https://github.com/LunarG/VulkanTests/pull/405

spencer-lunarg commented 10 months ago

I spent time last week and https://github.com/LunarG/VulkanTests/pull/483 should reset these to things I find are fixable

also we fixed teh NODEVICE_SELECT issue in https://github.com/LunarG/VulkanTests/pull/482