felixdoerre / primus_vk

Vulkan GPU-offloading layer
BSD 2-Clause "Simplified" License
230 stars 18 forks source link

Validation layers cause crashes #62

Closed montoyo closed 4 years ago

montoyo commented 4 years ago

Hi, First, I'd like to thank you for this awesome tool!

Recently I've been trying to write my own small renderer using Vulkan, so I have the validation layers (VK_LAYER_KHRONOS_validation) enabled. However, after creating the swapchain, the program runs for about half of a second and then segfaults. Looking at the logs reveals the following validation errors:

vulkan: [ERROR  ] [VALIDATION] [ VUID-vkGetPhysicalDeviceSurfaceSupportKHR-queueFamilyIndex-01269 ] Object: 0x55ab3f0dbc10 (Type = 2) | vkGetPhysicalDeviceSurfaceSupportKHR: queueFamilyIndex (= 1) is not less than any previously obtained pQueueFamilyPropertyCount from vkGetPhysicalDeviceQueueFamilyProperties (i.e. is not less than 1). The Vulkan spec states: queueFamilyIndex must be less than pQueueFamilyPropertyCount returned by vkGetPhysicalDeviceQueueFamilyProperties for the given physicalDevice (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkGetPhysicalDeviceSurfaceSupportKHR-queueFamilyIndex-01269)
vulkan: [ERROR  ] [VALIDATION] [ VUID-vkGetPhysicalDeviceSurfaceSupportKHR-queueFamilyIndex-01269 ] Object: 0x55ab3f0dbc10 (Type = 2) | vkGetPhysicalDeviceSurfaceSupportKHR: queueFamilyIndex (= 2) is not less than any previously obtained pQueueFamilyPropertyCount from vkGetPhysicalDeviceQueueFamilyProperties (i.e. is not less than 1). The Vulkan spec states: queueFamilyIndex must be less than pQueueFamilyPropertyCount returned by vkGetPhysicalDeviceQueueFamilyProperties for the given physicalDevice (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkGetPhysicalDeviceSurfaceSupportKHR-queueFamilyIndex-01269)

Obviously my code is trying to call vkGetPhysicalDeviceSurfaceSupportKHR with an out-of-range queue family index. However, after a quick debugging session, and I can say that pQueueFamilyPropertyCount is indeed equal to 3, and that vkGetPhysicalDeviceSurfaceSupportKHR returns valid values. So, the validation layers should not report any error!

Now, what's rather odd is that, if I just disable the validation layers, it works just fine! This is not specific to my program, running pvkrun vkcube also works fine, but pvkrun vkcube --validation crashes!

I haven't see anything about this in the 'technical limitations' section of PrimusVK's readme, so I'm assuming it's a bug. I'm attaching some logs, hopefully they will be helpful.

Thanks!

my_prog.log my_prog_validation.log vkcube.log vkcube_validation.log misc_info.txt

felixdoerre commented 4 years ago

Yes, I guess this is bad behaviour of primus_vk. However it currently is not your usecase to validate primus_vk but to validate your application, isn't it? That's caused by the layer ordering. You could explicitly order the layers by activating them both with VK_INSTANCE_LAYERS: Here you validate your application with primus_vk, this segfaults by me as well:

VK_INSTANCE_LAYERS=VK_LAYER_PRIMUS_PrimusVK:VK_LAYER_KHRONOS_validation optirun vkcube

Here you validate the apicalls that your application passes to primus_vk, this works for vkcube:

VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation:VK_LAYER_PRIMUS_PrimusVK optirun vkcube

You could try that as a workaround, it will report errors better fitting to your application sometimes. I'll debug primus_vk and check what goes wrong.

felixdoerre commented 4 years ago

I think, that I have solved the problem for primus_vk, could you test with the branch preview_queue_family_index ( https://github.com/felixdoerre/primus_vk/tree/preview_queue_family_index ) ?

felixdoerre commented 4 years ago

Just a few more thoughts as context: We have 3 queues on the Nvidia card and 1 queue on the intel card. Primus_vk hands out the nvidia card to your application and this application correctly detects 3 queues. When probing for surface support we cannot hand those requests to the nvidia driver (as the nvidia driver does now know about the surfaces on the intel driver and would be totally confused if being asked about them). Additionally we really need the support for the surface on the intel driver. But when implementing this, I was lazy and handed the nvidia-queueFamilyIndex through directly. However this does not make any sense. This caused the problems, as you correctly queried for queues 1 and 2 but the intel driver does not have those. This problem is reported (and crashed upon) by the validation layer. The intel driver itself does not seem to have any problem with that, as it just seems to ignore the passed queueFamilyIndex.

So now primus_vk determines a display-queueFamilyIndex on startup for the intel driver and uses that instead of the passed queueFamilyIndex. And as we are copying everything through memory we have not special requirements on the nvidia queueFamilyIndex and just ignore that value.

montoyo commented 4 years ago

Ah, I didn't know the order of layers had an influence. Indeed, after swapping the layers, it worked both with my program and vkcube. Thanks!

I tried preview_queue_family_index. For some reason vkcube only worked one time out of 4. When it didn't work, vkCreateInstance would fail, telling me to make sure I had an ICD installed. When it did work, the validations error were gone, but it would still segfault after a few seconds :/ In both cases, I didn't spot anything helpful in the logs; both GPUs were detected by PrimusVK, etc..

felixdoerre commented 4 years ago

Hi, ooh yes, the segfault is a few moments later... There was another error. I've updated preview_queue_family_index, and I've verified that vkcube now starts stable for me without any validation errors regardless of the layer order. Could you re-test as well, please?

montoyo commented 4 years ago

Works like a charm and no more validation errors! Awesome, thanks!

felixdoerre commented 4 years ago

I have pushed the fixes to master. I obviously forgot testing with the validation layers for quite some time now, and the issues they detected could be the cause of some very hard-to-spot issues. Thanks for reporting this.