Closed tgrant-nv closed 12 months ago
This LGTM, thanks for addressing all the offline feedback we discussed!
@chellmuth Do we still like this? Is it ready to merge?
I think there are some conflicts. @timgrant, can you rebase on top of the current main?
I will try to rebase, but I was hesitant to do so with main in its current state since I won't be able to confirm the correctness of the merge.
@chellmuth Do we still like this? Is it ready to merge?
Yeah, we've tested/benchmarked this code and saw some nice speedups.
Okay, I have rebased on main. It was surprisingly smooth.
These are the failures I'm seeing among the optix
tests:
The following tests FAILED:
179 - function-overloads.optix (Failed)
471 - testoptix-reparam.optix.opt (Failed)
515 - spline-boundarybug.optix (Failed)
734 - testoptix.optix.opt (Failed)
735 - testoptix-noise.optix.opt (Failed)
I believe this is the same set of failing tests from main.
I think 4 of those 5 are addressed by my PR yesterday, which I just merged into main. Chasing after that last one next.
I see the same set of failures in a current build of main with #1715. I'm using LLVM 14.0.3 and CUDA 11.8, FWIW.
Sorry, my bad, I hadn't actually done the merge yet. Coming shortly.
Thanks, Larry. I just tested with the 535.98 driver, and I'm still seeing three failures after #1718:
471 - testoptix-reparam.optix.opt (Failed)
734 - testoptix.optix.opt (Failed)
735 - testoptix-noise.optix.opt (Failed)
It's expected for testoptix.optix.opt to fail, since there are some printf
sub-tests that require the string contents (which we should get rid of). But the image diff sub-tests are also failing, which is not expected.
When I run testoptix-reparam.optix.opt, there is a blue background but the three spheres are all black. For the other two tests the images are all black.
Are these tests passing for you?
I also noticed that although spline-boundarybug.optix is not reported as failing, there is a slight precision difference in the output:
-(0.600000,0.306263)
+(0.600000,0.306262)
I found the source of the black shader output. There is a separate definition of ShaderGlobals
in testshade/cuda/rend_lib.h
, which does not include two new members added in #1702. Adding those fields clears up the newly failing tests. Larry, would you like to commit that as a separate fix? Or I can roll it into this PR.
diff --git a/src/testrender/cuda/rend_lib.h b/src/testrender/cuda/rend_lib.h
index e66f4e33..4ad59208 100644
--- a/src/testrender/cuda/rend_lib.h
+++ b/src/testrender/cuda/rend_lib.h
@@ -99,6 +99,8 @@ struct ShaderGlobals {
float3 Ps, dPsdx, dPsdy;
void* renderstate;
void* shadingStateUniform;
+ int thread_index;
+ int shade_index;
void* tracedata;
void* objdata;
void* context;
Would you mind sending that off as a separate PR and I'll just merge it right away?
Description
This PR is an overhaul of PTX compilation in OSL. The goal is to minimize the size of the generated PTX to help reduce time spent in subsequent stages (e.g., module creation, pipeline linking, etc.).
rend_lib
The main functional change is in how the functions defined in rend_lib.cu are handled. In the current setup, rend_lib.cu is compiled to a standalone PTX file using nvcc. The functions are considered
extern
in generated shaders, and the dependencies are satisfied when the final pipeline is linked together. Since the functions areextern
, they are not available for optimization when each shader is generated. This prevents the inlining of small functions, which incurs a lot of function call overhead in terms of PTX size and run-time cost.So instead of compiling rend_lib.cu separately, it is compiled to LLVM bitcode along with the rest of the "shadeops" sources. This bitcode is used to seed the LLVM Module for each shader. This makes the definitions available when each shader is being generated and optimized, which allows the optimizer to make better decisions. It also allows for better control of inlining decisions which affect the size and quality of the generated PTX.
shadeops
Another significant functional change is in how the shadeops functions (including those defined in rend_lib.cu) are treated. In the current pipeline, each shader must carry along with it definitions for all non-inlined shadeops functions that are used. This bloats the size of the PTX for each shader, and results in many duplicate copies of those functions existing in the final linked pipeline.
In the new pipeline, the unified shadeops+rend_lib bitcode is compiled to PTX using the offline
llc
tool. This PTX is used by the renderer to create a single "shadeops" module that provides the definitions for those functions. This makes it possible to drop the definitions for those functions from the PTX for each shader, which can significantly reduce the size of the generated PTX.quirks
This PR contains some interesting "quirks" which are largely fallout from compiling rend_lib.cu using clang instead of nvcc:
void*
instead of the actual OSL types (e.g.,OSL::Color3*
). Trying to use the actual types produces an LLVM assertion failure, for reasons that I don't understand.memcpy
andmemset
into long series of byte operations. Functions likeosl_get_matrix
are bloated substantially if the hints aren't provided.Tests
No new tests added, but no new test failures were added either. :wink:
Checklist: