[Roadmap Feedback] Function Pointers with some limitations

devshgraphicsprogramming commented 12 months ago

Problem statement:

As we all know that many major GPU architectures have been able to perform actual function calls for a while now, and that GPUs which support KHR_raytracing_pipeline usually have this feature as its much more efficient to perform a function call/jump based on an address stored in a SBT than to do fully inlined megakernels with switch or if-else chains/trees to branch into the correct function call.

Now obviously this would have to be subject to certain restrictions such as recursion depth, therefore I'd like to propose function pointers but with certain limitations:

function pointer used for a call has to be subgroup uniform (for now)
shader declares the maximum stack size required (so maximum recursion depth), if it does not we default to depth=1 (only one call from the entry point) and stack equal necesasry to call "largest" function out of all reachable (by pointer) functions
pointers of different function signatures cannot be casted to each other (similar limitations exist in WASM/Mono-WASM as well as strict-C etc., this would enable easy "compatibility" driver implementations using switches or better if-else binary search trees - Metal is actually an example here of a polyfill with a table+dispatcher)
no pointer arithmetic on the function pointers, only assignments

SPIR-V already has a somewhat nice SPV_INTEL_function_pointers extension, but its not for the Vulkan environment.

Vulkan (and by extension OpenCL via clspv) would befit a lot from this being available as Metal and CUDA both have the feature.

EDIT 1: SPIR-V would probably benefit from explicit OpSpill and OpRestore for Variables the compiler determines to be live across the function pointer call site, such that the implementation's SPIR-V to ISA compiler doesn't need to perform it (it could, just to validate or optimize). This could benefit Raytracing Callables and Workgraphs too, especially the latter as its still an AMDX and I feel like live variable analysis could go a long way towards making a nice KHR or EXT version.

EDIT 2: Could OpLifetimeStart and OpLifetimeEnd be used for the same purpose if allowed in Shader environment and not only Kernel?

Use Case Example(s):

Something like this in GLSL

void doA();
void doB();

...

void(*)(void) p = cond ? doA:doB;
p();

If we support non-uniform function calls then

nonuniformEXT(p)();

(Optional) Suggested Solution(s) (via opening an MR on vulkan-docs repo and creating a Proposal Document) :

Allow functions in SPIR-V subject to the restrictions outlined above to be able to have their addresses taken and stored in a Pointer Function Storage class or a new Function Pointer type.

I can see 3 fields in the extension properties struct:

maxStackSize, 0 would mean you can only do tail-calls, basically as part of return or as the last instruction in a void function
returnAddressSize, this tells you how many bytes are needed just in order to be able to non-tail recurse without any arguments
[optional] nonUniformFunctionPointerCallNative, would tell you whether the device has an Independent Program Counter and doesn't need to waterfall strip your calls

If this issue gets any traction I'll open the PR so CLA is signed.

An interesting read: https://xol.io/blah/gpus-function-calls/

devshgraphicsprogramming commented 12 months ago

Removed KHR_variable_pointers from the title and as an analogy, because I forgot, AGAIN that they cannot be stored anywhere and can't point to the Private storage class.

marty-johnson59 commented 11 months ago

Thank you for your suggestion! The Vulkan team very much values your feedback. We're collecting suggestions now and will review them in the Vulkan working group shortly.

krOoze commented 11 months ago

The main desirable convenience thing here is to have higher-order functions, right? Otherwise it is esthetics at best, and I am not even sure that fptr are more esthetical than switch. And higher-order functions may cause the code to be less transparent; especially wrt non-uniformity of the branching.

IDK, I would still find this somewhat prettier and transparent:

void p(){
    if( cond ) doA();
    else doB();
}

void main(){
    p();
}

What this compiles into could be made to depend on whether cond is constexpr, subgroup uniform, or dynamic.

There's bit of a push right now that if a feature exists in C/C++ then it must be in GPU APIs. But the origin of C is bit different, and not sure the abstractions always matches well to what GPU\SIMD is. E.g. even a simple if is bit more devious in GLSL than it is in C.

devshgraphicsprogramming commented 11 months ago

The main desirable convenience thing here is to have higher-order functions, right? Otherwise it is esthetics at best, and I am not even sure that fptr are more esthetical than switch. And higher-order functions may cause the code to be less transparent; especially wrt non-uniformity of the branching.

IDK, I would still find this somewhat prettier:
void p(){
  if( cond ) doA();
  else doB();
}

void main(){
  p();
}
What this compiles into may depend on whether cond is constexpr, subgroup uniform, or dynamic.

There's bit of a push right now that if a feature exists in C/C++ then it must be in GPU APIs. But the origin of C is bit different, and not sure the abstractions always matches well to what GPU\SIMD is. E.g. even a simple if is bit more devious in GLSL than it is in C.

A switch will rarely if ever get compiled to a jump table, the best you can hope for is an if-else chain which means instead of having an O(1) overhead on a "dynamic function call" you incur either O(log2(labelCount)) or O(labelCount) in the case of a simple if-else chain as opposed to a flattened binary search tree.

This means you end up paying instruction count/performance (not occupancy or size) overhead for code that you don't use.

krOoze commented 11 months ago

It may compile to nothing if the conditional is constexpr. In that case if or switch is basically a glorified preprocessor.

Which seems almost what you want here, considering the proposed restriction that the fptr needs to be uniform.

devshgraphicsprogramming commented 11 months ago

It may compile to nothing if the conditional is constexpr. In that case if or switch is basically a glorified preprocessor.

as @Hugobros3 will be happy to inform you, the compiler can perform the same analysis on a function pointer and inline function pointer calls if the pointer is known to be constant.

krOoze commented 11 months ago

Yes, but then it is just esthetics if both can do the same thing, right?

devshgraphicsprogramming commented 11 months ago

Yes, but then it is just esthetics if both can do the same thing, right?

no because if I make 512 functions with the same signature and make a switch dispatcher, in the best case I'm likely to have to pay for 9 branches, convergence/reconvergence checks and masking.

Worst case I'll be paying for 512 else if conditional evaluations.

I am not aware of any compiler that will actually codegen a switch as a jump table. (also even if it did there would be a bunch of restrictions on when it can actually do that, like label values and whether you fallthrough)

devshgraphicsprogramming commented 11 months ago

Furthermore you can't recurse with a switch or any similar dispatcher because SPIR-V requires structured control flow, so you can't even make your own stack :(

Hugobros3 commented 11 months ago

Actually I advocate for "real" function calls, other APIs have had them for years and they'd be a massive boon for Vulkan. True calls have advantages for generality (not having to know what will be called, expressing recursive algorithms naturally) and code size/quality (not having to inline every potential callee and explosively growing the module size).

They'd be useful even if there are restriction wrt uniformity, or only allow tail-calls, but if you must know what you're calling, then @krOoze is right and this is just (misleading!) syntactic sugar.

If you look at the new work graph stuff, we're slowly getting there, just in a roundabout way. I had a half-written proposal somewhere for SPIR-V, but the biggest problem will always be convincing the vendors to support it, and for that they want use-cases. Which is always a chicken-and-egg problem, because shading languages don't expose new features first either.

krOoze commented 11 months ago

"real" function calls

I assume everyone would like to have that for convenience\generality, and providing an usecase would be as simple as providing anything done in the "other APIs" or random C++ code for that matter. The question is whether that is the correct™ fitting abstraction, not just convenient one (on SIMD-like architecture). I assume you covered that point in your blog? Naively thinking, yea, GPU can do it, but at like 1/64th of efficiency.

devshgraphicsprogramming commented 11 months ago

"real" function calls

I assume everyone would like to have that for convenience, and providing an usecase would be as simple as providing anything done in the "other APIs" or random C++ code for that matter. The question is whether that is the correct™ fitting abstraction, not just convenient one (on SIMD-like architecture). I assume you covered that point in your blog?

The hardware of multiple vendors can already do it (a subgroup uniform jump), the "correct abstraction" discussion is settled, its what the HW is capable of doing and SPIR-V should expose it.

You can obviously have the discussion whether SPIR-V should do a major breaking change and allow unstructured control flow OR function pointers, given that most compilers tend to be written on top of LLVM or be woefully inadequate or saddled with so much tech debt they can't innovate in a meaningful way (looking at GLSL compilers and some HLSL compilers here) these are the only two choices you really have at the IR-level.

P.S. The reason I'm asking for function pointers is because SPIR-V has decided on Structured Control Flow early on in the development process, and I doubt you can introduce jumps (even uniform ones) and labels so far down the road without blowing everything up which really relied on banning unstructured control flow.

Hugobros3 commented 11 months ago

The question is whether that is the correct™ fitting abstraction, not just convenient one (on SIMD-like architecture). I assume you covered that point in your blog? Naively thinking, yea, GPU can do it, but at like 1/64th of efficiency.

I'd like you to read my post if you can spare the time, but in a nutshell, you're conflating the idea of jumping/calling somewhere and diverging. This proposal requirements enforce uniformity, so the only cost would be stashing away and recovering data on a stack of some sort, and it would not slow down operations in the callee.

Besides that, it's a silly argument to say that non-uniform calls would cause slowdowns, because the alternative to them are big if/else trees or switch statements to emulate the same functionality. Calls and function pointers are useful because they allow creating higher-order functions and data structures, which enable better abstractions.

Besides that still, the "calls" found in DX12 work graphs and VK_AMDX_shader_queue effectively implement invocation repacking, so they cost far less, possibly zero divergence. These calls are one-way but you could implement returns by doing CpS transformations in a clever compiler

devshgraphicsprogramming commented 11 months ago

Look on a hardware level, all of the following:

if/else if/else
for
while
do while
switch
uninlined function (yes the spec says all functions must be inlineable, but its the driver's choice whether to inline or not)
function pointer call

are implemented as Jumps/Gotos in the ISA, the only difference in the latter two the jump destination (return) address at the end of the block of code comes dynamically from register and isn't a constant. The only difference between an un-inlined function and a fptr call is that the jump address to enter the routine is also not known in advance.

Btw an optimized (jumptable) switch has the reverse behaviour, the address to jump to is not known, but the return address is constant.

Either way, a TAIL function pointer call, has literally no overhead or difference to a if/loop/switch because its effectively the same thing. Hint, if you're worried about divergent function calls executing at 1/64 the speed (or when you stop caring about GCN, 1/32) then a similarly divergent if or switch will also execute at 1/64 the speed.

Finally, even if you call different function pointers in each SIMD lane, your return jump address is guaranteed to be the dynamically uniform across the active lanes! (your active callees will reconverge)

This is why it makes more sense to ask for Function Pointer calls in SPIR-V than unstructured control flow, because if you start allowing random gotos you have no guarantees on the invocation coming back to the call site and reconverging.

devshgraphicsprogramming commented 11 months ago

added an edit about live variable analysis.

leios commented 2 months ago

Has this discussion moved anywhere (even outside of this issue)? I am also quite interested in having some form of function pointers passed through Vulkan. There are some limitations to pointers when used in CUDA (And I assume other compute APIs). It's also not possible in OpenCL (AFAIK).

If Vulkan can somehow deal with pointers in a cleaner way, then there's a good reason for certain workloads to use it instead of traditional compute APIs.

devshgraphicsprogramming commented 2 months ago

Has this discussion moved anywhere (even outside of this issue)? I am also quite interested in having some form of function pointers passed through Vulkan. There are some limitations to pointers when used in CUDA (And I assume other compute APIs). It's also not possible in OpenCL (AFAIK).

If Vulkan can somehow deal with pointers in a cleaner way, then there's a good reason for certain workloads to use it instead of traditional compute APIs.

I believe we may see some extension from the Mesa side, as someone's hobby project.

KhronosGroup / Vulkan-Docs