coreylowman / dfdx

Deep learning in Rust, with shape checked tensors and neural networks
Other
1.67k stars 98 forks source link

Add OpenCL device #597

Open coreylowman opened 1 year ago

coreylowman commented 1 year ago

As brought up in reddit thread, a OpenCL device would be useful to support folks with AMD gpus.

Here are roughly the tasks that need to happen:

  1. [ ] Add opencl feature flag
  2. [ ] Create an OpenCL device struct under src/tensor/opencl/device.rs & add impl DeviceStorage for OpenCL
  3. [ ] Implement tensor allocation methods under src/tensor/opencl/allocate.rs. See src/tensor/cuda/allocate.rs for examples
    1. [ ] impl ZerosTensor
    2. [ ] impl OnesTensor
    3. [ ] impl OneFillStorage
    4. [ ] impl SampleTensor
    5. [ ] impl CopySlice
    6. [ ] impl TensorFromVec
    7. [ ] impl TensorToArray
  4. [ ] Implement skeleton kernels for OpenCL under tensor_ops - tests should compile & run but fail
    1. [ ] Add module opencl.rs to all kernel folders
    2. [ ] Add a skeleton impl for all kernels (put todo!() in every kernel method)
    3. [ ] add impl Device<E> for OpenCL to src/tensor_ops/utilities/device.rs
  5. [ ] Actually implement OpenCL kernels - this is highly parallelizable, each tensor op can be worked on separately. Every tensor op with a folder will need a kernel
jedbrown commented 1 year ago

An alternative (lower effort and better supported by AMD) would be to provide HIP support. The interface is nearly perfect 1:1 with CUDA, including hipRTC.

Nabushika commented 1 year ago

I've got some experience with OpenCL, especially in the context of generating kernels from compile-time representation of expressions, in C++. This sounds fun and I'd love to help! :)

coreylowman commented 1 year ago

An alternative (lower effort and better supported by AMD) would be to provide HIP support. The interface is nearly perfect 1:1 with CUDA, including hipRTC.

Is that this? https://github.com/ROCm-Developer-Tools/HIP

Do you know if there's a c API that we can create a FFI for?

jansol commented 1 year ago

HIP would still leave out Intel GPUs though. (starting to become relevant now that they have discrete GPUs too!)

OpenCL would cover all vendors. Alternatively Intel seems to be pushing OneAPI as their "native" compute thing.

kstavro commented 1 year ago

The first thing I asked when I joined the dfdx discord a couple of months ago was if OpenCL support was planned. Not an OpenCL expert, but I would like to be involved in this, if I can help.

jedbrown commented 1 year ago

@coreylowman Yes, perhaps start with this table translating CUDA to HIP. The hipify tool is also relevant, though it processes C/C++, not Rust FFI.

@jansol The CHIP-SPV project is a HIP implementation for Intel GPUs. It's missing some edge cases, but largely functional on real hardware. Intel wants people to use SYCL (which they brand as DPC++), but it's missing some features like run-time compilation (nvrtc/hiprtc). OpenCL still works, but vendors clearly aren't fond of it. SYCL will probably get more extensions in this area.

coreylowman commented 1 year ago

Considering going with WebGPU backend (#604) instead of OpenCL, as it seems like it will cover the non-nvidia GPU case AND web case. Any OpenCL people against this? Thoughts?

jansol commented 1 year ago

WebGPU should cover all current GPUs quite well. I have been warned that compute shaders do still have some limitations compared to dedicated compute APIs but for dfdx it should be possible to design around those, should they ever become an obstacle.

OpenCL would still have the benefit of having several readily available implementations for non-GPU devices (CPU, FPGA, custom accelerators) and not having to optimize for each target by hand. Essentially it comes down to whether you value that extra flexibility.

Nabushika commented 1 year ago

Considering going with WebGPU backend (#604) instead of OpenCL, as it seems like it will cover the non-nvidia GPU case AND web case. Any OpenCL people against this? Thoughts?

I wonder what the performance difference would be vs running natively. I haven't had much experience with webGPU so I have no idea. However IMO if the design is modular (as it should be), then there shouldn't be too much work supporting both backends right? It's just adding a few new objects and trait impls that people can enable via feature flag or maybe even be auto-selected.

I ask about performance simply because while a 10% performance drop might be acceptable for some, it'd make this much less desirable over a custom OpenCL/CUDA/whatever else implementation for some situations.

coreylowman commented 1 year ago

However IMO if the design is modular (as it should be), then there shouldn't be too much work supporting both backends right

It just takes away time from doing other things - there are currently ~44 and counting tensor ops that require writing kernels. A large number of those are pretty simple (like unary operations), but things like reductions/conv/matmul kernels are more complicated. If we had a universal language for specifying kernels that covers cuda/opencl/webgpu then I think it'd be a different story.