KhronosGroup / Vulkan-Loader

Vulkan Loader
https://vulkan.lunarg.com/doc/sdk/latest/linux/LoaderInterfaceArchitecture.html
Other
513 stars 280 forks source link

vkGetInstanceProcAddr returns non-NULL for unsupported functions #1443

Open baldurk opened 8 months ago

baldurk commented 8 months ago

Describe the bug The vulkan spec defines what vkGetInstanceProcAddr should return for different cases, but aside from a couple of special functions like global functions the main three cases that return non-NULL are:

  • core dispatchable command
  • enabled instance extension dispatchable command for instance
  • available device extension dispatchable command for instance

And it says "any other case, not covered above" must return NULL explicitly. There are some squirrelly details above especially regarding the recent maintenance5 change to whether core functions higher than the application requested can be returned and things, but I don't see any case where an entirely unsupported extension function would be allowed to return a function pointer.

When testing though, it looks like any extension that the loader is aware of at build time will return a function pointer here via trampoline_get_proc_addr -> extension_instance_gpa, when instead it should return NULL for extensions that are not available. For example on my desktop nvidia system I was able to get a function pointer for vkGetDynamicRenderingTilePropertiesQCOM which is not a core command, enabled instance extension command, or available device extension command.

I believe this is not the case for extensions that the loader is unaware of at build time, in that case it looks like it will only return a function pointer if something down the chain implements it.

I ran into this because it was causing some crashes when mesa goes to query for calibrateable timestamp functions. It queries first for the promoted vkGetPhysicalDeviceCalibrateableTimeDomainsKHR and then only if that is NULL it queries for vkGetPhysicalDeviceCalibrateableTimeDomainsEXT.

Previously due to RenderDoc having a similar bug this results in a crash where it calls through the KHR function, doesn't call through RenderDoc's layer to unwrap the physical device, and breaks after that because the physical device isn't as expected. I fixed RenderDoc's bug to properly return NULL for the KHR function, but that means it just crashes from calling directly to NULL from the trampoline. I've also now worked around this issue by implementing the KHR extension in RenderDoc, but I believe this will still crash if code like that (which is dubiously valid according to the spec) is run directly on a driver that supports the EXT but not the KHR.

Environment (please complete the following information):

To Reproduce Steps to reproduce the behavior:

  1. Create a vulkan 1.3 instance with no extensions or only WSI extensions.
  2. Enumerate only an nvidia physical device that does not support VK_QCOM_tile_properties, or substitute for any clearly unsupported extension.
  3. Call vkGetInstanceProcAddr(instance, "vkGetDynamicRenderingTilePropertiesQCOM")
  4. See that instead of NULL returned, we get a function pointer. It's not valid to call but it should not have been returned.

VK_LOADER_DEBUG output vk_loader_debug.txt

Additional context To get the proper Get*ProcAddr behaviour from RenderDoc you'll need to build from latest v1.x branch as I've just pushed the fix to return NULL from GIPA/GDPA for unknown functions. The bug is independent of that though.

charles-lunarg commented 8 months ago

This is a duplicate of #306, which appears to have been created by a user after discussion with you.

The crux of the problem is that applications can query any and all device functions with vkGetInstanceProcAddr before any VkPhysicalDevices have been enumerated or VkDevices created making it impossible to check if the extension is supported at that time. While the loader could enumerate physical devices and get the available device extensions for each, there is no guarantee the application would enable the function on a vkdevice, leading to another edge case.

That said, I have to agree that the loader is in the wrong here, I am just unsure of what can be done about it. Very open to suggestions.

baldurk commented 8 months ago

I wasn't aware of the previous issue sorry, I searched for vkGetInstanceProcAddr in the issues list but managed to miss that one somehow.

This might be a problem people will run into more commonly now as I suspect that kind of "if KHR, use KHR, else use EXT" pattern will be one other users will use with the recent promotions of a few EXTs to KHR which I don't think had happened so directly before.

To clarify, calling GIPA on a function does not guarantee that you'll later be able to use the function pointer necessarily. If as you say you GIPA a device extension function pointer and then don't enable that extension, you can't use it. However if you GIPA and get a function pointer it should at least be usable in some circumstance (only on the physical device(s) that support it, and only if enabled). The same applies to e.g. GIPA for a vulkan 1.3 function which then can't be used on a vulkan 1.2 physical device. It's an edge case, but it is well defined by the spec.

I did notice that there was code in there for handling instance extensions correctly - so a non-enabled but available instance extension would return NULL. That is a different case, but if I were solving this my first approach would be to extend that check for 'enabled instance extensions' that's there to also look for 'available device extensions' as those are the two subsets we care about, which would happen at instance creation time and not for every call to vkGetInstanceProcAddr.

charles-lunarg commented 8 months ago

Ah, so the solution of querying the available device extensions and checking that list is sufficient. That's a relief as it is tractable to implement.

I am unsure what priority I can give the issue due to it being one that has been around for a long time, but agree that it may be seen more as people move from EXT to KHR extension function pointers.

charles-lunarg commented 8 months ago

Also, this strikes me as an issue with the specification. The loader cannot follow the spec 100% becausevkGetInstanceProcAddr is defined to 'work' for device functions but the loader has no authoritative way to determine if a function is valid until after a device is created. Its a bit of a catch 22 and the current behavior is our best effort.

The reality is that this issue has existed in the loader for a number of years and so its not critical to have it be resolved ASAP (ie, this isn't a regression). Due to that, I cannot work on the above solution immediately.

baldurk commented 8 months ago

I'm not quite sure I understand what you mean, aside from the question of prioritisation. The trampolines that the loader returns today before a device is created should be fine, working on dispatch tables after the device has been created for device functions, it's just a question of when those trampolines are returned.

As in the discussion above, the functions which are returned do not have to work if the device they're called on would not allow them to be legally used. The loader doesn't have to predict the future and only return functions that are valid, that would definitely be a catch 22, it can return potentially invalid functions so long as it doesn't return functions which definitely can't ever be valid for the set of physical devices. The distinction being a function pointer can be returned if its extension is available on any device, which is a check that can be done before any device is created. The responsibility is on the application to not use it if the extension is not then subsequently enabled on the device that's created. If it's not even available in the first place then it should return NULL.

At instance creation time checking the union of all device extensions from the physical devices and noting that in the same way that it tracks enabled instance extensions seems to me like it would be following the spec 100%, though maybe there's an issue I'm missing there?

charles-lunarg commented 8 months ago

Apologies, I was going off of memory when thinking about what the specification says.

My thought process was that the implementation would, during vkCreateInstance and after returning up the layer call chain, enumerate physical devices and device extensions then fill out a struct of 'available device extensions'. This struct caches the device extension list so it doesn't have to be queried on every call to vkGetInstanceProcAddr made by the application. My concern was that subsequently after vkCreateInstance returns to the application, layers or even drivers may change the device extension list that is available (which doesn't make a ton of sense, but is theoretically possible) making the loader struct of available extensions out of date. The case of the cache saying an extension is valid when it actually isn't could cause crashes, while the case of the cache saying an extension isn't present would cause apps to be unable to functions they otherwise should be able to. I'm not sure this line of reasoning holds up, but was what prompted me to respond. Ultimately this lead to questioning if there needs to be spec language.

Side note - caching the list of available device extensions is problematic if the VkPhysicalDevices is dynamic, ie hot-swapping of GPU's is explicitly supported by the API. The current loader model of 'just returning device functions no matter what' supports hot swapping at the cost of users needing to be extra sure to not call invalid but not null function pointers.

baldurk commented 8 months ago

Yeh I'm not sure that the spec actually calls out that extension lists should be invariant (at least invariant given the inputs of a particular instance's creation parameters), I wonder if that's thought to be implied? It feels like if some system did do that then you're starting to strain the definition of 'available' in this context and any behaviour like function pointers for extensions that disappear, or missing function pointers for extensions that appear, would be more a driver/layer bug from being inconsistent than a loader bug at that point. An application would still be limited by the set of extensions it can successfully enable at device creation time.

It is true that caching the extensions in an instance doesn't hold up if the list of physical devices change, but it seems to me like other things would break there too. What if the list of physical devices changes between the application enumerating them and creating a device? Since the lifetime of physical device handles is tied to the lifetime of an instance I'd also expect that list to be invariant, otherwise do you end up with invalid handles that are unspecified? I'd naively expect you to have to create a new instance to refresh the list of physical devices or get 'up to date' info, but that said I'm definitely not an authority on whether hotplug is intended to be supported (or how it would be spec'd since I don't think there's any spec language covering that today?)

I think with hindsight probably a lot of people would have preferred an explicit GetProcAddr for instance, physical device, and device level functionality that only works for each level, since that would ameliorate almost all of these problems. But we have what we have :).

zmike commented 8 months ago

This appears to be a larger scale issue. I've started getting tickets about it in mesa https://gitlab.freedesktop.org/mesa/mesa/-/issues/10624

Edu4rdSHL commented 8 months ago

Hello, this issue is currently impacting all Linux users who have updated to Mesa 24 which bumped the vulkan headers via https://gitlab.freedesktop.org/mesa/mesa/-/commit/c9e1758462caf342ab9f0a0acc3b392c5b2354d8 and therefore introduced the bug.

More details: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10624

charles-lunarg commented 8 months ago

@zmike I'm sorry, but how is this loader bug the cause of the issue?

I see that its related to updating the headers to 1.3.273, which would cause some functions that were returning NULL to now return non-null, but that is all that should change. Applications which don't check if they can call the function pointer are going to call the loader's trampoline and subsequently crash on NULL deference.

That means that it is possible for the application to 'check' that any given device function is valid before calling. Alternatively, don't use vkGetInstanceProcAddr to get device functions, use vkGetDeviceProcAddr. It'll give you NULL if its unsupported (or well it should) and won't have the loader trampoline so its also faster.

I understand that this loader issue potentially causes difficulty, but this loader bug is NOT new (see #306) so I'm entirely surprised that this is the first time it is such a big issue.

zmike commented 8 months ago

I think this is likely the first time there's been a wave of commonly used KHR promotions that everyone is using? In either case, I'll double check my usage.

charles-lunarg commented 8 months ago

EXT to KHR promotion should not change how something is effected by this bug AFAIK.

If a function is 'known' to the loader, that is when this bug comes into effect, as the loader blindly returns trampolines to known device functions from vkGetInstanceProcAddr.

Ironically, unknown device functions are better handled here.

vorporeal commented 8 months ago

Hey there, I'm a maintainer of Warp (terminal emulator) - we're getting a number of user reports of crashes due to the linked mesa issue. (See the mention that was added above.)

@zmike @charles-lunarg I'm primarily interested in mitigating the user impact, and personally have little knowledge about mesa and Vulkan-Loader, so I'm having trouble figuring this out for myself:

Is there something that can be done on our end to mitigate these crashes? Is the issue with user code? mesa code? Vulkan-Loader code?

zmike commented 8 months ago

EXT to KHR promotion should not change how something is effected by this bug AFAIK.

If a function is 'known' to the loader, that is when this bug comes into effect, as the loader blindly returns trampolines to known device functions from vkGetInstanceProcAddr.

Ironically, unknown device functions are better handled here.

False alarm on my end, I think this is just a very annoying codegen bug in mesa.

scygan commented 5 months ago

This issue is currently causing failures in VK-GL-CTS. It looks like the test framework is loading vkGetPhysicalDeviceCalibrateableTimeDomainsEXT (and ..KHR) with vkGetInstanceProcAddr , using following logic:

https://github.com/KhronosGroup/VK-GL-CTS/blob/aefd5c09cacdb5a3320da717af6ec646d96c2379/external/vulkancts/framework/vulkan/generated/vulkansc/vkInitInstanceFunctionPointers.inl#L82

if (!m_vk.getPhysicalDeviceCalibrateableTimeDomainsKHR)
    m_vk.getPhysicalDeviceCalibrateableTimeDomainsKHR = (GetPhysicalDeviceCalibrateableTimeDomainsKHRFunc) GET_PROC_ADDR("vkGetPhysicalDeviceCalibrateableTimeDomainsEXT");

This looks in-line with the Vulkan specification: image image

It seems the behavior where loader returns not-NULL for functions not supported by any physical device is against the spec and should be changed.

On platforms, which do support VK_EXT_calibrated_timestamps but not VK_KHR_calibrated_timestamps this causes a VK-CTS failure (KHR function is called in the loader, without a way to call proper function in the driver).

charles-lunarg commented 4 months ago

My instinct for this issue is to implement the solution described above: At the end of instance creation, enumerate physical devices and their device extensions & cache it. While some layers are known to remove physical devices, this is not an issue as it 'was available' at some point. The specification doesn't have much language to describe hot-swapping or similar, where a physical device may become available at a point later than vkCreateInstance. The recent loader change which unloads drivers if they contain no physical devices was predicated upon the working group deciding that proper support for hot-swapping would require more changes to the spec to properly handle everything. This means that the loader can generally assume that the list of physical devices & its device extensions are stable, at least for the purposes of this issue.

I will work on creating a PR to address this in the coming days. The logic exists for instance extensions, so its only a matter of altering it to work with device extensions.

charles-lunarg commented 4 months ago

I have created a rough PR with the solution. Not all tests pass and more work is needed for naming/organization but is mostly there.

https://github.com/charles-lunarg/Vulkan-Loader/tree/implement_available_device_extensions_for_vkGetInstanceProcAddr

But after bringing up this issue to the SI working group, it is apparent that a solution which complicates matters in the long run isn't practical. The WG would even be interested in altering the spec to reflect the current behavior (for backwards compat).

jjulianoatnv commented 3 months ago

Why is this from the issue description:

See that instead of NULL returned, we get a function pointer. It's not valid to call but it should not have been returned.

considered to be a bug?

Why do you think the loader should be disallowed from returning non-NULL? Why can't the trampoline function call return a "not supported" VkError if the trampoline function is called with input parameters that the trampoline cannot dispatch to an ICD?

ianelliottus commented 3 months ago

After the 10 July, 2024 SI TSG meeting, I have a simple proposal: 1) "fix the loader" to only return a trampoline if the function is "supported"; 2) add text to the spec (normative text and/or a note) to articulate what an app must/should not do (e.g. ask for and use a trampoline for a non-supported function).

jjulianoatnv commented 3 months ago

The refpage for vkGetInstanceProcAddr has a footnote 4 that defines an available device extension:

An “available device extension” is a device extension supported by any physical device enumerated by instance.

The definition builds upon a concept of a supported device extension.

Where is the definition of a supported device extension? I'm not finding a definition. If this concept is not clearly defined, then neither is the concept of an available device extension.

cubanismo commented 3 months ago

After the 10 July, 2024 SI TSG meeting, I have a simple proposal: 1) "fix the loader" to only return a trampoline if the function is "supported"

I'm pretty strongly opposed to this. I think this is effectively useless behavior and the loader is right, the spec is wrong. Given this is how the loader has always behaved in practice, we should just fix the spec and stop asking the loader to filter trampolines when they aren't guaranteed to work anyway. The fact there is some filtering just misleads applications IMHO. Just specify that functions returned by GetInstanceProcAddr might not work, which is already true with or without any spec change. Any app relying on this filtering is effectively broken already on at least some configurations.