Open coreylowman opened 1 year ago
An alternative (lower effort and better supported by AMD) would be to provide HIP support. The interface is nearly perfect 1:1 with CUDA, including hipRTC.
I've got some experience with OpenCL, especially in the context of generating kernels from compile-time representation of expressions, in C++. This sounds fun and I'd love to help! :)
An alternative (lower effort and better supported by AMD) would be to provide HIP support. The interface is nearly perfect 1:1 with CUDA, including hipRTC.
Is that this? https://github.com/ROCm-Developer-Tools/HIP
Do you know if there's a c API that we can create a FFI for?
HIP would still leave out Intel GPUs though. (starting to become relevant now that they have discrete GPUs too!)
OpenCL would cover all vendors. Alternatively Intel seems to be pushing OneAPI as their "native" compute thing.
The first thing I asked when I joined the dfdx discord a couple of months ago was if OpenCL support was planned. Not an OpenCL expert, but I would like to be involved in this, if I can help.
@coreylowman Yes, perhaps start with this table translating CUDA to HIP. The hipify tool is also relevant, though it processes C/C++, not Rust FFI.
@jansol The CHIP-SPV project is a HIP implementation for Intel GPUs. It's missing some edge cases, but largely functional on real hardware. Intel wants people to use SYCL (which they brand as DPC++), but it's missing some features like run-time compilation (nvrtc/hiprtc). OpenCL still works, but vendors clearly aren't fond of it. SYCL will probably get more extensions in this area.
Considering going with WebGPU backend (#604) instead of OpenCL, as it seems like it will cover the non-nvidia GPU case AND web case. Any OpenCL people against this? Thoughts?
WebGPU should cover all current GPUs quite well. I have been warned that compute shaders do still have some limitations compared to dedicated compute APIs but for dfdx it should be possible to design around those, should they ever become an obstacle.
OpenCL would still have the benefit of having several readily available implementations for non-GPU devices (CPU, FPGA, custom accelerators) and not having to optimize for each target by hand. Essentially it comes down to whether you value that extra flexibility.
Considering going with WebGPU backend (#604) instead of OpenCL, as it seems like it will cover the non-nvidia GPU case AND web case. Any OpenCL people against this? Thoughts?
I wonder what the performance difference would be vs running natively. I haven't had much experience with webGPU so I have no idea. However IMO if the design is modular (as it should be), then there shouldn't be too much work supporting both backends right? It's just adding a few new objects and trait impls that people can enable via feature flag or maybe even be auto-selected.
I ask about performance simply because while a 10% performance drop might be acceptable for some, it'd make this much less desirable over a custom OpenCL/CUDA/whatever else implementation for some situations.
However IMO if the design is modular (as it should be), then there shouldn't be too much work supporting both backends right
It just takes away time from doing other things - there are currently ~44 and counting tensor ops that require writing kernels. A large number of those are pretty simple (like unary operations), but things like reductions/conv/matmul kernels are more complicated. If we had a universal language for specifying kernels that covers cuda/opencl/webgpu then I think it'd be a different story.
As brought up in reddit thread, a OpenCL device would be useful to support folks with AMD gpus.
Here are roughly the tasks that need to happen:
opencl
feature flagOpenCL
device struct undersrc/tensor/opencl/device.rs
& addimpl DeviceStorage
forOpenCL
src/tensor/opencl/allocate.rs
. Seesrc/tensor/cuda/allocate.rs
for examplesOpenCL
under tensor_ops - tests should compile & run but failopencl.rs
to all kernel folderstodo!()
in every kernel method)impl Device<E>
forOpenCL
tosrc/tensor_ops/utilities/device.rs
OpenCL
kernels - this is highly parallelizable, each tensor op can be worked on separately. Every tensor op with a folder will need a kernel