Open devshgraphicsprogramming opened 12 months ago
Removed KHR_variable_pointers
from the title and as an analogy, because I forgot, AGAIN that they cannot be stored anywhere and can't point to the Private storage class.
Thank you for your suggestion! The Vulkan team very much values your feedback. We're collecting suggestions now and will review them in the Vulkan working group shortly.
The main desirable convenience thing here is to have higher-order functions, right? Otherwise it is esthetics at best, and I am not even sure that fptr are more esthetical than switch
. And higher-order functions may cause the code to be less transparent; especially wrt non-uniformity of the branching.
IDK, I would still find this somewhat prettier and transparent:
void p(){
if( cond ) doA();
else doB();
}
void main(){
p();
}
What this compiles into could be made to depend on whether cond
is constexpr, subgroup uniform, or dynamic.
There's bit of a push right now that if a feature exists in C/C++ then it must be in GPU APIs. But the origin of C is bit different, and not sure the abstractions always matches well to what GPU\SIMD is. E.g. even a simple if
is bit more devious in GLSL than it is in C.
The main desirable convenience thing here is to have higher-order functions, right? Otherwise it is esthetics at best, and I am not even sure that fptr are more esthetical than
switch
. And higher-order functions may cause the code to be less transparent; especially wrt non-uniformity of the branching.IDK, I would still find this somewhat prettier:
void p(){ if( cond ) doA(); else doB(); } void main(){ p(); }
What this compiles into may depend on whether
cond
is constexpr, subgroup uniform, or dynamic.There's bit of a push right now that if a feature exists in C/C++ then it must be in GPU APIs. But the origin of C is bit different, and not sure the abstractions always matches well to what GPU\SIMD is. E.g. even a simple
if
is bit more devious in GLSL than it is in C.
A switch will rarely if ever get compiled to a jump table, the best you can hope for is an if-else chain which means instead of having an O(1) overhead on a "dynamic function call" you incur either O(log2(labelCount))
or O(labelCount)
in the case of a simple if-else chain as opposed to a flattened binary search tree.
This means you end up paying instruction count/performance (not occupancy or size) overhead for code that you don't use.
It may compile to nothing if the conditional is constexpr. In that case if
or switch
is basically a glorified preprocessor.
Which seems almost what you want here, considering the proposed restriction that the fptr needs to be uniform.
It may compile to nothing if the conditional is constexpr. In that case
if
orswitch
is basically a glorified preprocessor.
as @Hugobros3 will be happy to inform you, the compiler can perform the same analysis on a function pointer and inline function pointer calls if the pointer is known to be constant.
Yes, but then it is just esthetics if both can do the same thing, right?
Yes, but then it is just esthetics if both can do the same thing, right?
no because if I make 512 functions with the same signature and make a switch
dispatcher, in the best case I'm likely to have to pay for 9 branches, convergence/reconvergence checks and masking.
Worst case I'll be paying for 512 else if
conditional evaluations.
I am not aware of any compiler that will actually codegen a switch
as a jump table.
(also even if it did there would be a bunch of restrictions on when it can actually do that, like label values and whether you fallthrough)
Furthermore you can't recurse with a switch
or any similar dispatcher because SPIR-V requires structured control flow, so you can't even make your own stack :(
Actually I advocate for "real" function calls, other APIs have had them for years and they'd be a massive boon for Vulkan. True calls have advantages for generality (not having to know what will be called, expressing recursive algorithms naturally) and code size/quality (not having to inline every potential callee and explosively growing the module size).
They'd be useful even if there are restriction wrt uniformity, or only allow tail-calls, but if you must know what you're calling, then @krOoze is right and this is just (misleading!) syntactic sugar.
If you look at the new work graph stuff, we're slowly getting there, just in a roundabout way. I had a half-written proposal somewhere for SPIR-V, but the biggest problem will always be convincing the vendors to support it, and for that they want use-cases. Which is always a chicken-and-egg problem, because shading languages don't expose new features first either.
"real" function calls
I assume everyone would like to have that for convenience\generality, and providing an usecase would be as simple as providing anything done in the "other APIs" or random C++ code for that matter. The question is whether that is the correct™ fitting abstraction, not just convenient one (on SIMD-like architecture). I assume you covered that point in your blog? Naively thinking, yea, GPU can do it, but at like 1/64th of efficiency.
"real" function calls
I assume everyone would like to have that for convenience, and providing an usecase would be as simple as providing anything done in the "other APIs" or random C++ code for that matter. The question is whether that is the correct™ fitting abstraction, not just convenient one (on SIMD-like architecture). I assume you covered that point in your blog?
The hardware of multiple vendors can already do it (a subgroup uniform jump), the "correct abstraction" discussion is settled, its what the HW is capable of doing and SPIR-V should expose it.
You can obviously have the discussion whether SPIR-V should do a major breaking change and allow unstructured control flow OR function pointers, given that most compilers tend to be written on top of LLVM or be woefully inadequate or saddled with so much tech debt they can't innovate in a meaningful way (looking at GLSL compilers and some HLSL compilers here) these are the only two choices you really have at the IR-level.
P.S. The reason I'm asking for function pointers is because SPIR-V has decided on Structured Control Flow early on in the development process, and I doubt you can introduce jumps (even uniform ones) and labels so far down the road without blowing everything up which really relied on banning unstructured control flow.
The question is whether that is the correct™ fitting abstraction, not just convenient one (on SIMD-like architecture). I assume you covered that point in your blog? Naively thinking, yea, GPU can do it, but at like 1/64th of efficiency.
I'd like you to read my post if you can spare the time, but in a nutshell, you're conflating the idea of jumping/calling somewhere and diverging. This proposal requirements enforce uniformity, so the only cost would be stashing away and recovering data on a stack of some sort, and it would not slow down operations in the callee.
Besides that, it's a silly argument to say that non-uniform calls would cause slowdowns, because the alternative to them are big if/else trees or switch statements to emulate the same functionality. Calls and function pointers are useful because they allow creating higher-order functions and data structures, which enable better abstractions.
Besides that still, the "calls" found in DX12 work graphs and VK_AMDX_shader_queue
effectively implement invocation repacking, so they cost far less, possibly zero divergence. These calls are one-way but you could implement returns by doing CpS transformations in a clever compiler
Look on a hardware level, all of the following:
are implemented as Jumps/Gotos in the ISA, the only difference in the latter two the jump destination (return) address at the end of the block of code comes dynamically from register and isn't a constant. The only difference between an un-inlined function and a fptr call is that the jump address to enter the routine is also not known in advance.
Btw an optimized (jumptable) switch has the reverse behaviour, the address to jump to is not known, but the return address is constant.
Either way, a TAIL function pointer call, has literally no overhead or difference to a if/loop/switch because its effectively the same thing. Hint, if you're worried about divergent function calls executing at 1/64 the speed (or when you stop caring about GCN, 1/32) then a similarly divergent if
or switch
will also execute at 1/64 the speed.
Finally, even if you call different function pointers in each SIMD lane, your return jump address is guaranteed to be the dynamically uniform across the active lanes! (your active callees will reconverge)
This is why it makes more sense to ask for Function Pointer calls in SPIR-V than unstructured control flow, because if you start allowing random gotos you have no guarantees on the invocation coming back to the call site and reconverging.
added an edit about live variable analysis.
Has this discussion moved anywhere (even outside of this issue)? I am also quite interested in having some form of function pointers passed through Vulkan. There are some limitations to pointers when used in CUDA (And I assume other compute APIs). It's also not possible in OpenCL (AFAIK).
If Vulkan can somehow deal with pointers in a cleaner way, then there's a good reason for certain workloads to use it instead of traditional compute APIs.
Has this discussion moved anywhere (even outside of this issue)? I am also quite interested in having some form of function pointers passed through Vulkan. There are some limitations to pointers when used in CUDA (And I assume other compute APIs). It's also not possible in OpenCL (AFAIK).
If Vulkan can somehow deal with pointers in a cleaner way, then there's a good reason for certain workloads to use it instead of traditional compute APIs.
I believe we may see some extension from the Mesa side, as someone's hobby project.
Problem statement:
As we all know that many major GPU architectures have been able to perform actual function calls for a while now, and that GPUs which support
KHR_raytracing_pipeline
usually have this feature as its much more efficient to perform a function call/jump based on an address stored in a SBT than to do fully inlined megakernels withswitch
orif-else
chains/trees to branch into the correct function call.Now obviously this would have to be subject to certain restrictions such as recursion depth, therefore I'd like to propose function pointers but with certain limitations:
SPIR-V already has a somewhat nice
SPV_INTEL_function_pointers
extension, but its not for the Vulkan environment.Vulkan (and by extension OpenCL via clspv) would befit a lot from this being available as Metal and CUDA both have the feature.
EDIT 1: SPIR-V would probably benefit from explicit
OpSpill
andOpRestore
for Variables the compiler determines to be live across the function pointer call site, such that the implementation's SPIR-V to ISA compiler doesn't need to perform it (it could, just to validate or optimize). This could benefit Raytracing Callables and Workgraphs too, especially the latter as its still anAMDX
and I feel like live variable analysis could go a long way towards making a niceKHR
orEXT
version.EDIT 2: Could
OpLifetimeStart
andOpLifetimeEnd
be used for the same purpose if allowed inShader
environment and not onlyKernel
?Use Case Example(s):
Something like this in GLSL
If we support non-uniform function calls then
(Optional) Suggested Solution(s) (via opening an MR on vulkan-docs repo and creating a Proposal Document) :
Allow functions in SPIR-V subject to the restrictions outlined above to be able to have their addresses taken and stored in a Pointer Function Storage class or a new Function Pointer type.
I can see 3 fields in the extension properties struct:
return
or as the last instruction in a void functionIf this issue gets any traction I'll open the PR so CLA is signed.
An interesting read: https://xol.io/blah/gpus-function-calls/