Introduce OptiX direct callable API that owns groupdata buffer

chellmuth commented 1 year ago

Description

On gpu, the renderer currently owns a shader's groupdata params buffer so that it can pass its pointer to both the init and entry functions.

The downside of this approach is that in order to avoid dynamic memory allocation, the renderer must commit to a buffer size at build time. Using a conservative size to accommodate potentially large shaders can lead to gigabytes of unnecessary memory footprint.

This patch adds a new OptiX direct callable alternative to the existing init and entry callables. This new function allocates a perfectly-sized groupdata params buffer, so that the gpu only pays for the memory it needs, and then it calls init and entry.

Particularly large shaders can require larger buffers than there is space on the cuda stack. To handle this case, there is a new option, "max_optix_groupdata_alloc". Any shader requiring a groupdata buffer larger than this value will not allocate its own buffer, and instead use the pointer passed in by the renderer (presumably coming from a global memory pool).

Tests

All testshade tests run with the existing api, and again with the new fused api.

Checklist:

[x] I have read the contribution guidelines.
[x] I have previously submitted a Contributor License Agreement.
[x] I have updated the documentation, if applicable.
[x] I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
[x] My code follows the prevailing code style of this project.

tgrant-nv commented 1 year ago

This looks very cool, Chris. I'm still kicking the tires, but it seems like a great step forward.

tgrant-nv commented 1 year ago

I see that the fused group function and the separate init/group functions all make it into the generated PTX. If you know in advance that you will or won't be using the fused function, it might be a good idea to drop the unused entry points. It would certainly make the PTX a lot smaller, and could save time in codegen and optimization.

chellmuth commented 1 year ago

I see that the fused group function and the separate init/group functions all make it into the generated PTX. If you know in advance that you will or won't be using the fused function, it might be a good idea to drop the unused entry points. It would certainly make the PTX a lot smaller, and could save time in codegen and optimization.

I'm happy to add an option that lets the user specify which callable(s) they're using, so we can remove the other. For what it's worth, in my tests the wrapper callables are just a couple dozen lines and keeping them all doesn't appear to affect codegen times.

chellmuth commented 1 year ago

Latest push rebases on the recent reparameter work, and switches optix-testrender from the current api to the single-callable api.

lgritz commented 1 year ago

Code looks fine, but can you please document the new attribute name you added, in oslexec.h where all the others are documented? Thanks.

chellmuth commented 1 year ago

Ah yeah, you caught me pushing commits only partially addressing your feedback :) Latest commit adds the attribute documentation.

AcademySoftwareFoundation / OpenShadingLanguage