KhronosGroup / Vulkan-Loader

Vulkan Loader
https://vulkan.lunarg.com/doc/sdk/latest/linux/LoaderInterfaceArchitecture.html
Other
520 stars 283 forks source link

Enabling extension in layer causes crash #51

Closed pjaholkowski closed 3 years ago

pjaholkowski commented 6 years ago

Way to reproduce

  1. Don't pass VK_KHR_get_physical_device_properties2 as instance extension while calling vkCreateInstance inside vulkan application
  2. Enable VK_KHR_get_physical_device_properties2 while calling vkCreateInstance inside implicit layer
  3. Inside implicit layer call vkGetPhysicalDeviceProperties2KHR it will cause AccessViolation exception inside loader.c in function VKAPI_ATTR void VKAPI_CALL terminator_GetPhysicalDeviceProperties2(VkPhysicalDevice physicalDevice, VkPhysicalDeviceProperties2 *pProperties)

It causes AccessViolation at

VKAPI_ATTR void VKAPI_CALL terminator_GetPhysicalDeviceProperties2(VkPhysicalDevice physicalDevice, VkPhysicalDeviceProperties2 pProperties) { struct loader_physical_device_term phys_dev_term = (struct loader_physical_device_term )physicalDevice; struct loader_icd_term icd_term = phys_dev_term->this_icd_term; const struct loader_instance *inst = icd_term->this_instance;

// Get the function pointer to use to call into the ICD. This could be the core or KHR version
PFN_vkGetPhysicalDeviceProperties2 fpGetPhysicalDeviceProperties2 = NULL;
if (inst != NULL && inst->enabled_known_extensions.khr_get_physical_device_properties2) {
    fpGetPhysicalDeviceProperties2 = icd_term->dispatch.GetPhysicalDeviceProperties2KHR;
} else {
    fpGetPhysicalDeviceProperties2 = icd_term->dispatch.GetPhysicalDeviceProperties2;
}

if (fpGetPhysicalDeviceProperties2 != NULL || !inst->enabled_known_extensions.khr_get_physical_device_properties2) {
    // Pass the call to the driver
    fpGetPhysicalDeviceProperties2(phys_dev_term->phys_dev, pProperties); // <- !!!! calling null pointer
} else {
    // Emulate the call
    loader_log(icd_term->this_instance, VK_DEBUG_REPORT_INFORMATION_BIT_EXT, 0,
               "vkGetPhysicalDeviceProperties2: Emulating call in ICD \"%s\" using vkGetPhysicalDeviceProperties",
               icd_term->scanned_icd->lib_name);

.....

because

fpGetPhysicalDeviceProperties2 == nullptr; inst->enabled_known_extensions.khr_get_physical_device_properties2 == false; icd_term->dispatch.GetPhysicalDeviceProperties2KHR != nullptr; icd_term->dispatch.GetPhysicalDeviceProperties2 == nullptr;

Full crash dump with dll versions at https://ufile.io/ix9jl problem doesn't exist on Intel and Nvidia GPUs

lenny-lunarg commented 6 years ago

I really have no idea if adding an extension to the list of enabled extensions is allowed or not. As far as I can tell, it isn't explicitly outlawed and the spec is pretty vague in this area. That makes me suspect that this should be allowed. However, the loader is definitely not written in a way that supports this.

The reason this crashes is that the loader is figuring out which instance extensions are enabled in the trampoline function for vkCreateInstance (which is called before the layers). As a result, the loader doesn't realize the extension is enabled and the behavior is very messed up when you try to call into any function from the extension. Fixing this problem would require reworking at least a little bit of the loader's logic because it just isn't set up to handle this right now.

But I'm thinking I should check with some other people to see if we had intended for this to be legal or illegal (or never thought of this case) before I commit to any specific behavior. I'll update this issue when I have a more concrete answer as to whether or not we think this should be allowed.

Also, I'm curious what your motivation is for trying this. I ask because knowing the use case often lets us figure out if this is really needed or if there are other ways of accomplishing the same thing.

pjaholkowski commented 6 years ago

Hi

Well I'm trying to copy image from swapchain from selected vulkan app to shared directx 11 texture with as less image copying as possible. In order to do that i need these extensions enabled:

"instance_extensions": [ { "name": "VK_KHR_get_physical_device_properties2", "spec_version": "2" }, { "name": "VK_KHR_external_memory_capabilities", "spec_version": "1" } ], "device_extensions": [ { "name": "VK_KHR_dedicated_allocation", "spec_version": "1" }, { "name": "VK_KHR_external_memory", "spec_version": "1" }, { "name": "VK_KHR_external_memory_win32", "spec_version": "1" }, { "name": "VK_KHR_win32_keyed_mutex", "spec_version": "1" }, { "name": "VK_KHR_get_memory_requirements2", "spec_version": "1" }, { "name": "VK_KHR_bind_memory2", "spec_version": "1" } ]

Actually not all of them it depends on external memory properties of shared texture. As you said " As far as I can tell, it isn't explicitly outlawed and the spec is pretty vague in this area" In my opinion disallowing layer to add specific extension to instance is illogical because there exist layers which already do that since a long time for example VK_LAYER_LUNARG_core_validation uses VK_EXT_debug_report and you don't need to enable VK_EXT_debug_report in app vkCreateInstance in order to use it (but it is explicit layer maybe here is a catch).

On second thought it is not loader fault that application crashes. It works on diffrent vendor apis because they fill

phys_dev_term->this_icd_term->dispatch.GetPhysicalDeviceProperties2KHR; phys_dev_term->this_icd_term->dispatch.GetPhysicalDeviceProperties2;

irrespectively of VK_KHR_get_physical_device_properties2 extension usage. The only thing that is weird: Why loader allows calling null pointer ?

if (fpGetPhysicalDeviceProperties2 != NULL || !inst->enabled_known_extensions.khr_get_physical_device_properties2) { // Pass the call to the driver fpGetPhysicalDeviceProperties2(phys_dev_term->phys_dev, pProperties); }

I had to fix that somehow without waiting for updates: So if I acquire (loader_instance*) from VkPhysicalDevice and set these flags inst->enabled_known_extensions.khr_get_physical_device_properties2 = 1; inst->enabled_known_extensions.khr_external_memory_capabilities = 1; everything works.

lenny-lunarg commented 6 years ago

VK_LAYER_LUNARG_core_validation uses VK_EXT_debug_report and you don't need to enable VK_EXT_debug_report in app vkCreateInstance in order to use it (but it is explicit layer maybe here is a catch).

Core validation implements VK_EXT_debug_marker, which is a different situation than enabling an extension that is implemented by the driver. There have also been bugs found in that implementation recently, so I wouldn't be surprised if core validation is doing this wrong.

On second thought it is not loader fault that application crashes.

It is the loader's fault. The problem is were it assigns the function pointer:

if (inst != NULL && inst->enabled_known_extensions.khr_get_physical_device_properties2) {
    fpGetPhysicalDeviceProperties2 = icd_term->dispatch.GetPhysicalDeviceProperties2KHR;
} else {
    fpGetPhysicalDeviceProperties2 = icd_term->dispatch.GetPhysicalDeviceProperties2;
}

Since the loader thinks the extension isn't enabled, it gets the wrong function pointer. That function pointer being null is required for a 1.0 driver and is perfectly legal for a 1.1 driver if the application requested a 1.0 instance. The fact that the loader calls into a null pointer is acceptable if the application (or layer) made an illegal Vulkan call. This is allowed because Vulkan generally does not do any error checking to make sure API calls are legal.

As far as figuring out if changing the enabled instances is legal, I can see some problems with allowing this:

While I do think there is some value to letting a layer enable an extension, my current thinking is that it would work far better if it was done through some other mechanism, separate from modifying parameters in vkCreateInstance.

ianelliottus commented 6 years ago

@lenny-lunarg just asked me about this. As one of the original people to work in this area, and as one still involved in the spec for this, I will state that I think it's illegal/undefined for a layer to unilaterally enable an extension. Here's some quick rationale (not taking the time to quote exact parts of the spec):

Layers are not allowed to extend/change Vulkan functionality. They can provide extensions, and those extensions are allowed to extend/change Vulkan functionality.

However, it's the application's responsibility to enable extensions.

Besides the spec aspects of this, layers are ill-positioned to enable an extension. There is not guarantee where a given layer will be in the layer stack. As a result, if it adds an extension to the list, some parts of Vulkan won't see the extension being enabled. This will result in inconsistent dispatch tables, with some parts of the system having a nullptr and others having a valid pointer for the same command. That means that eventually, nullptr's will eventually be de-referenced. I'm guessing that's what you're running into.

The debug extension is an interesting example. It's one place that we cheated on the rules, and took extra steps to handle it (since we were implementing the loader, layers, and the original/sample ICD).

Hope that helps, Ian Elliott

pjaholkowski commented 6 years ago

"While I do think there is some value to letting a layer enable an extension, my current thinking is that it would work far better if it was done through some other mechanism, separate from modifying parameters in vkCreateInstance."

At first I wanted to hook vkCreateInstance without need to create a layer but it is not ultimate solution because loader can be linked as static and I can't think of any uninvasive solution to find address of vkCreateInstance before it is called. Maybe something like additional pre_instance function in json 1.1.12 should be used. That way loader in vkCreateInstance trampoline would be able to reorder layers based on extension dependency. It would be also nice to notify each layer about all enabled extensions in VkInstanceCreateInfo sended to it's vkCreateInstance callback

"Besides the spec aspects of this, layers are ill-positioned to enable an extension. There is not guarantee where a given layer will be in the layer stack. As a result, if it adds an extension to the list, some parts of Vulkan won't see the extension being enabled. This will result in inconsistent dispatch tables, with some parts of the system having a nullptr and others having a valid pointer for the same command. That means that eventually, nullptr's will eventually be de-referenced. I'm guessing that's what you're running into."

What about letting instance to be created then destroy it and create new one from a layer by calling vkCreateInstance trampoline with added extensions and layers.

if(extensionsAndLayersIWantNotEnabled) { VkLayerInstanceDispatchTable pInstanceDispatch = ((VkLayerInstanceDispatchTable *)pInstance);

PFN_vkCreateInstance  pLoaderCreateInstance  = (PFN_vkCreateInstance)pInstanceDispatch->GetInstanceProcAddr(nullptr, "vkCreateInstance");

PFN_vkCreateInstance pNextCreateInstance = (PFN_vkCreateInstance)vkNextGetInstanceProcAddr(VK_NULL_HANDLE, "vkCreateInstance");

VkResult ret = pNextCreateInstance(&oldCreateInfo, pAllocator, pInstance);

{
    //Increase reference count because call to pLoaderDestroyInstance will unload layer module 
    const uint32_t layerPathSize = 1024;
    wchar_t layerPath[layerPathSize];

    HMODULE layerHandle = GetModuleHandleW(L"LayerName.dll");
    GetModuleFileNameW(layerHandle, layerPath, layerPathSize);

    layerHandle = LoadLibraryW(layerPath);
}

PFN_vkDestroyInstance pLoaderDestroyInstance = (PFN_vkDestroyInstance)pInstanceDispatch->GetInstanceProcAddr(*pInstance, "vkDestroyInstance");

pLoaderDestroyInstance(*pInstance, pAllocator);

VkResult result = pLoaderCreateInstance(&newCreateInfo, pAllocator, pInstance);
return result;

}

All problems you wrote about doesn't exist in that case. Am I right ?

airlied commented 5 years ago

I'm also hitting this problem trying to come up with a default device layer for Linux, since we really need to get more info from the driver to make a choice on what is to be default. Info like EXT_pci_bus_info provides, but if the app doesn't enable it then the layer cannot do it's job. I also just don't see many apps enabling that extension or some of the GetPhysicalDeviceProperties2 exts.

pjaholkowski commented 5 years ago

Do you implement your layer only for x86-x64 architecture ?

airlied commented 5 years ago

any layer for device selection will be for all Linux arches (so at least x86 64/32, ppc-le 64/32, arm 64/32).

pjaholkowski commented 5 years ago

I've got an idea how to make it work on Windows x86-64 which can be translated for Linux. However I don't know if it will work on ppc-le or arm probably not. I will have this implemented by end of the weekend. I can show you solution if you want.

lostgoat commented 4 years ago

@pjaholkowski @lenny-lunarg This functionality is fairly important for SteamVR. We have a set of early OpenVR Vulkan games that did not call into the OpenVR runtime to query for the required Vulkan Instance/Device extensions: https://github.com/ValveSoftware/openvr/wiki/Vulkan#extensions

Our workaround for such games has been to enable the missing extensions through a layer: https://github.com/lostgoat/VulkanTools/commit/7735db4fbc53f02841af47b9d8e8926d66db7207

Additionally, this might also be relevant for OpenXR. See this internal discussion: https://gitlab.khronos.org/openxr/openxr/-/merge_requests/1626#note_260403

pjaholkowski commented 4 years ago

Hi!

Some problems of this type are partially solved if the extensions you need are included as core extensions. Then in vkCreateInstance hook you need to change api version in VkApplicationInfo.apiVersion structure to version which included it as core. You can find example in https://github.com/obsproject/obs-studio/blob/master/plugins/win-capture/graphics-hook/vulkan-capture.c

lenny-lunarg commented 4 years ago

A couple of thoughts from my end:

Overall, my view at this point (which I realize is somewhat different from what I said in the past) is that I'm concerned that we're talking about adding new functionality which will just over complicate things without having much benefit. If we decide that it's absolutely needed, then we can do it, but I'm much more eager to look for alternative solutions.

nyorain commented 3 years ago

Is there an ongoing need to enable new instance extensions from a layer

I've just run into the same issue. There are many instance extensions that can be useful in a layer e.g. for querying additional information or manipulating surface creation (possibly requiring new extensions). IMO giving layers the possibility to do this is a good thing, layers using it poorly or in a way that conflicts with the vulkan spec is a separate issue and shouldn't be an argument against it.

nyorain commented 3 years ago

but I'm much more eager to look for alternative solutions

Does force-enabling instance extensions inside the loader based on an environment variable sound more acceptable? It's a hack as well but makes layers that just require certain extensions usable (e.g. it should allow using the pci functionality from mesa's device selection layer as well even though yet another env variable is needed) but as an explicit user opt-in. I hacked together a version where the loader just checks an environment variable "VK_LOADER_FORCE_INST_EXTS" and enables the passed extension list on instance creation, this approach works for me. If this sounds remotely acceptable or worth further discussion, I will happily submit a cleaned-up implementation as PR.

cubanismo commented 3 years ago

I wanted to make a quick note here: We ran into this issue as well in some internal projects, and for us, the specific need was to gain access to VK_KHR_get_physical_device_properties2 functionality from a layer, regardless of Vulkan core version of extensions enabled by the application.

lenny-lunarg commented 3 years ago

We talked over this a little in the SI conference today. The feeling was that the one use case where enabling an instance extension is needed is VK_KHR_physical_device_properties2. While I would have some interest in supporting that, it would be a fairly large change, we have a workaround that covers most cases (including newer loaders), and we have plenty of other things we need to do. As a result, it's just not high enough priority that I expect to get to this. I'm closing this issue because I don't expect to get to it. If this becomes higher priority we can always reopen this, but for now I'm closing this as a won't fix issue.