CUDA vs WGSL - Githubissues

junjihashimoto commented 3 months ago

CUDA's matmul is 1.5 times faster than WGLS's one. I would like to know what the performance overhead is.

WGSL

~/git/gpu.cpp/examples/matmul (main)
$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU (UUID: GPU-d69302e7-ae22-5fbc-39a3-11f66a27d0ac)
$ git rev-parse HEAD
5e030eb142fd25d92c2a64d338bd84265ba1106e
$ git diff
diff --git a/examples/matmul/run.cpp b/examples/matmul/run.cpp
index 4e61968..ae9a127 100644
--- a/examples/matmul/run.cpp
+++ b/examples/matmul/run.cpp
@@ -552,9 +552,9 @@ Kernel selectMatmul(Context &ctx, int version,
     kernel = createKernel(ctx, matmul, bindings,
                           /*nWorkgroups*/ nWorkgroups);
   } else if (version == 4 || version == 6) {
-    static constexpr size_t BM = 64;
+    static constexpr size_t BM = 128;
     static constexpr size_t BK = 16;
-    static constexpr size_t BN = 64;
+    static constexpr size_t BN = 128;
     static constexpr size_t TM = BM / BK;
     static constexpr size_t TN = BN / BK;
     Shape wgSize = {(BM / TM) * (BN / TN), 1, 1}; // This is the same as BK * BK.
$ make | grep GFLOPS
138.5 milliseconds / dispatch ~ 1984.94 GFLOPS

CUDA (An equivalent implementation of WGSL's matmul)

The CUDA matmul code is in the gist(https://gist.github.com/junjihashimoto/3a3020797076f8b5a0b4afcf0b448b93).

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU (UUID: GPU-d69302e7-ae22-5fbc-39a3-11f66a27d0ac)
$ nvcc matmul.cu
$ ./a.out
Execution time: 355.806 ms
Execution time / iterations: 71.1612 ms
GFLOPS: 3862.75

Looking at the results below, it seems that this implementation can improve performance by 2x.

CUDA (Simon Boehm's kernels)

FYI, the performance of Simon Boehm's kernels is as follows:

The GPU is "NVIDIA GeForce RTX 3080 Laptop GPU".

austinvhuang commented 3 months ago

Thanks for trying this out.

One thing different that's also been on my TODO has been to dispatch the matmul iterations asynchronously. In the previous code it blocks on each dispatch. The cuda implementation does this by default with cudaEventSynchronize.

I pushed some changes which do this + adjust BK (16 -> 8) which on my M1 bump me ~ 25% from ~ 2 TFLOPS to 2.5 TFLOPS. I haven't had time to try this out on my A6000 but will play around later. One big TODO is to have a small autotuning module to automatically optimize tiling parameters to help avoid the manual tweaking (and allow automation to find the right parameters for a given environment.

That said, I think there'll always be some portability cost relative to raw CUDA due to the limits of vulkan (which is probably the runtime target for linux + nvidia) and webgpu itself. The goal here is how close can we get by eliminating weaknesses in implementation and if we hit a hard ceiling due to specific missing webgpu / vulkan features, that can be valuable feedback to share with the committees developing those standards.

austinvhuang commented 3 months ago

Closing for now but will post updates here as they arise.

junjihashimoto commented 3 months ago

I compared SM Warp Occupancy on CUDA and WGSL. Both have the same occupancy rate: 33%. But the processing time(about 30ms) for CUDA appears to be half that (about 70ms) of WGSL. This is probably because the quality of TINT's SPIR-V is inferior to the quality of NVCC's PTX. TINT does not seem to optimize its code.

CUDA's SM Warp Occupancy

WGSL(Dawn)'s SM Warp Occupancy

MichealReed commented 3 months ago

seem to optimize its code.

https://github.com/KhronosGroup/SPIRV-Tools?tab=readme-ov-file

Maybe the optimizer from the SPIR-V Tools repository is not ran in the pipeline by default?

junjihashimoto commented 3 months ago

It seems that the SPIR-V Tools is used for SPIR-V reader not writer. https://github.com/search?q=repo%3Agoogle%2Fdawn+SPIR-V+opt&type=code&p=1 https://github.com/google/dawn/blob/2e32da10881cf975ee1364c857f8384c0fe734d5/docs/tint/spirv-reader-overview.md?plain=1#L25-L26

So the optimizer is not ran.

MichealReed commented 3 months ago

It seems that the SPIR-V Tools is used for SPIR-V reader not writer. https://github.com/search?q=repo%3Agoogle%2Fdawn+SPIR-V+opt&type=code&p=1 https://github.com/google/dawn/blob/2e32da10881cf975ee1364c857f8384c0fe734d5/docs/tint/spirv-reader-overview.md?plain=1#L25-L26

So the optimizer is not ran.

Here is more info on it

https://www.lunarg.com/wp-content/uploads/2017/08/SPIR-V-Shader-Size-Reduction-Using-spirv-opt_v1.0.pdf

and the use in the ShaderC repo

https://github.com/search?q=repo%3Agoogle%2Fshaderc%20%20spvtools%3A%3AOptimizer&type=code

edit: seems we can control use of specific readers and writers with these flags when building the lib

-DTINT_BUILD_SPV_READER=ON -DTINT_BUILD_SPV_WRITER=ON -DTINT_BUILD_WGSL_READER=ON -DTINT_BUILD_WGSL_WRITER=ON -DTINT_BUILD_MSL_WRITER=ON -DTINT_BUILD_HLSL_WRITER=ON

I also read "Besides kernels and physical addressing, there are a few other features that are not currently supported and will cause these passes to return silently without making changes." in the doc from 2017.

Maybe compute kernels are not optimized still?

https://github.com/KhronosGroup/SPIRV-Tools/issues/5597 may be related.

AnswerDotAI / gpu.cpp

CUDA vs WGSL #20

WGSL

CUDA (An equivalent implementation of WGSL's matmul)

CUDA (Simon Boehm's kernels)

CUDA's SM Warp Occupancy

WGSL(Dawn)'s SM Warp Occupancy