Open coreylowman opened 1 year ago
It may be possible to make this into one kernel again by calling twice the number of threads, and having a few conditionals to switch between lhs/rhs behavior. I suspect that CUDA may not be running them in parallel because they both require access to the lhs. rhs, and grad_out buffers, and CUDA may not be able to figure out that the kernel doesn't write to those buffers.
Huh I wonder if there's some typing/annotation we can write to make that more clear? I figured const
would be enough 🤷
Maybe we need to do something with __restrict__
: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/
Currently we run the lhs & rhs kernels on separate streams, but due to kernel occupancy, they can't actually run in parallel:
Investigate ways to make these run in parallel, and if it actually helps things. For example, if we have to make the kernels take twice as long to execute to run them in parallel, we might not gain anything from that.