coreylowman / dfdx

Deep learning in Rust, with shape checked tensors and neural networks
Other
1.74k stars 98 forks source link

Make binary backward kernels run in parallel #701

Open coreylowman opened 1 year ago

coreylowman commented 1 year ago

Currently we run the lhs & rhs kernels on separate streams, but due to kernel occupancy, they can't actually run in parallel:

image

Investigate ways to make these run in parallel, and if it actually helps things. For example, if we have to make the kernels take twice as long to execute to run them in parallel, we might not gain anything from that.

nkoppel commented 1 year ago

It may be possible to make this into one kernel again by calling twice the number of threads, and having a few conditionals to switch between lhs/rhs behavior. I suspect that CUDA may not be running them in parallel because they both require access to the lhs. rhs, and grad_out buffers, and CUDA may not be able to figure out that the kernel doesn't write to those buffers.

coreylowman commented 1 year ago

Huh I wonder if there's some typing/annotation we can write to make that more clear? I figured const would be enough 🤷

Maybe we need to do something with __restrict__: https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/