Closed junjihashimoto closed 3 months ago
Thanks for trying this out.
One thing different that's also been on my TODO has been to dispatch the matmul iterations asynchronously. In the previous code it blocks on each dispatch. The cuda implementation does this by default with cudaEventSynchronize
.
I pushed some changes which do this + adjust BK (16 -> 8) which on my M1 bump me ~ 25% from ~ 2 TFLOPS to 2.5 TFLOPS. I haven't had time to try this out on my A6000 but will play around later. One big TODO is to have a small autotuning module to automatically optimize tiling parameters to help avoid the manual tweaking (and allow automation to find the right parameters for a given environment.
That said, I think there'll always be some portability cost relative to raw CUDA due to the limits of vulkan (which is probably the runtime target for linux + nvidia) and webgpu itself. The goal here is how close can we get by eliminating weaknesses in implementation and if we hit a hard ceiling due to specific missing webgpu / vulkan features, that can be valuable feedback to share with the committees developing those standards.
Closing for now but will post updates here as they arise.
I compared SM Warp Occupancy on CUDA and WGSL. Both have the same occupancy rate: 33%. But the processing time(about 30ms) for CUDA appears to be half that (about 70ms) of WGSL. This is probably because the quality of TINT's SPIR-V is inferior to the quality of NVCC's PTX. TINT does not seem to optimize its code.
seem to optimize its code.
https://github.com/KhronosGroup/SPIRV-Tools?tab=readme-ov-file
Maybe the optimizer from the SPIR-V Tools repository is not ran in the pipeline by default?
It seems that the SPIR-V Tools is used for SPIR-V reader not writer. https://github.com/search?q=repo%3Agoogle%2Fdawn+SPIR-V+opt&type=code&p=1 https://github.com/google/dawn/blob/2e32da10881cf975ee1364c857f8384c0fe734d5/docs/tint/spirv-reader-overview.md?plain=1#L25-L26
So the optimizer is not ran.
It seems that the SPIR-V Tools is used for SPIR-V reader not writer. https://github.com/search?q=repo%3Agoogle%2Fdawn+SPIR-V+opt&type=code&p=1 https://github.com/google/dawn/blob/2e32da10881cf975ee1364c857f8384c0fe734d5/docs/tint/spirv-reader-overview.md?plain=1#L25-L26
So the optimizer is not ran.
Here is more info on it
and the use in the ShaderC repo
https://github.com/search?q=repo%3Agoogle%2Fshaderc%20%20spvtools%3A%3AOptimizer&type=code
edit: seems we can control use of specific readers and writers with these flags when building the lib
-DTINT_BUILD_SPV_READER=ON -DTINT_BUILD_SPV_WRITER=ON -DTINT_BUILD_WGSL_READER=ON -DTINT_BUILD_WGSL_WRITER=ON -DTINT_BUILD_MSL_WRITER=ON -DTINT_BUILD_HLSL_WRITER=ON
I also read "Besides kernels and physical addressing, there are a few other features that are not currently supported and will cause these passes to return silently without making changes." in the doc from 2017.
Maybe compute kernels are not optimized still?
https://github.com/KhronosGroup/SPIRV-Tools/issues/5597 may be related.
CUDA's matmul is 1.5 times faster than WGLS's one. I would like to know what the performance overhead is.
WGSL
CUDA (An equivalent implementation of WGSL's matmul)
The CUDA matmul code is in the gist(https://gist.github.com/junjihashimoto/3a3020797076f8b5a0b4afcf0b448b93).
Looking at the results below, it seems that this implementation can improve performance by 2x.
CUDA (Simon Boehm's kernels)
FYI, the performance of Simon Boehm's kernels is as follows:
The GPU is "NVIDIA GeForce RTX 3080 Laptop GPU".