GPU Mega Issue - Githubissues

coreylowman commented 2 years ago

There's a lot of work to be done here. Very rough list of todos:

Preparation
- [x] Move map functions to devices #199
- [x] Move conv to devices #198
- [x] Add where clauses for map functions to make partial progress on kernels possible (so we can start using cuda without all ops implemented)
Devices
- [ ] Add Cuda device that wraps cudarc::CudaDevice and an rng
- [x] Add StdRng to Cpu
- [x] Add rng seed to device construction
- [ ] ~~Add two GATs to device trait: DeviceArc and DeviceRng~~
  - [ ] ~~Add CpuRc which contains Arc<T> and Arc<Cpu>~~
Tensors
- [x] Add Device to all tensor structs
- [x] TensorCreator should accept &Device as parameter, and remove Rng since that will be accessed through device
- [x] Move Device to generic argument of Tensors
- [ ] ~~Enable moving tensors between devices~~
nn
- [x] Add trait ModuleCreator
  - [x] Add ModuleCreator::zeros(Device)
  - [x] Add ModuleCreator::default(Device) which calls zeros & reset params
- [x] Remove implementations for Default
- [x] Remove rng parameter from ResetParams, should use tensor's devices
Kernels
- [x] Add trait LaunchKernel<K, Args>
- [x] Move all Cpu traits to a combo of impl LaunchKernel<...> for Cpu and trait <Kernel>CpuImpl/impl <Kernel>CpuImpl for <Kernel>. See cudarc/examples/kernels.rs
- [ ] (In a separate crate) proc macro that wraps around kernels and maps them to something usable for ptx compiling (e.g. kernel!(|a, b, c| { *a = b + c }) (#185)
- [ ] Look into when/how to build the kernels (compile time hopefully??) (#184)
Testing
- [x] Add feature based device construction in all tests (something like #[cfg(feature="test-cuda"]) that when specified uses cuda instead of cpu?
- [x] Add macro build_test_device!() to use that uses testing features to create the device

Done:

[x] Is it even possible to compile a rust closure to a cuda kernel? Assuming very small set of supported operations. Is this worth the maintainability?
- [x] If we go the fixed set of functions route, how many different generic closures does dfdx use currently?
- ANSWER: Yes it is possible (the rust cuda project does it), but it will take some work. Automatic closure conversion to kernel is probably the direction i'll be trying to go since hand building all the cuda kernels next to the cpu closures seems too much work.
[x] What functionality does nvidia provide for deep learning already? Assuming matmul & conv forward/backward. How to use these?
- ANSWER: cudnn, all tensors are 4d, supports base set of operations. probably not what we want to depend on tbh since it doesn't support everything we would need on GPU (e.g. optimizer kernels)

coreylowman commented 2 years ago

This'll likely involved another generic parameter for Tensors. Maybe a trait Device that supports base operations? Will need to look into crates that do cuda operations...

coreylowman commented 2 years ago

There is now a Cpu device, and a Device trait. As of now all device methods are static, but this could be changed for GPUs. If GPU device idx can be stored as a const usize then it wouldn't have to change.

coreylowman commented 2 years ago

See crates: https://crates.io/crates/rustacuda, https://crates.io/crates/cuda-driver-sys, https://crates.io/crates/cuda-runtime-sys

coreylowman commented 2 years ago

Another crate to look at: https://crates.io/crates/gpu-alloc

jafioti commented 2 years ago

Another more pure rust project: https://github.com/Rust-GPU/Rust-CUDA

nielsle commented 2 years ago

You may be able to use https://github.com/arrayfire/arrayfire-rust

Timmmm commented 2 years ago

I wrote a simple FTDT test with Arrayfire-rust and it was very slow. Maybe I was doing something wrong but I think it probably makes some significant performance sacrifices to make the API easier (it is very easy). I probably wouldn't recommend that route.

Btw I guess the way to do this would be to basically allow the compute graph to be compiled to ONNX, then you can use the existing GPU implementations of all of its operators.

You need to decide whether it's going to support eager execution or fully precompiled modes, or both. Looks like at the moment it works only in eagar execution mode (operations are done as you come to them). Accelerators like GPUs (and all of the new AI accelerators) tend to work much much better with fully precompiled models, where you basically feed it a big compute graph (like an ONNX model) and then it compiles the whole thing and optimises it all at once.

Eagar mode is nicer for developers because you can do control flow on the CPU and you can use a REPL (more relevant for Python than Rust). That's partly why Pytorch (eagar by default) became more popular than Tensorflow (compiled by default) and Tensorflow eventually had to implement eagar execution. Downside of eagar execution is it makes it harder to export a model, e.g. to ONNX. Pytorch does it by tracing execution, which has big caveats (e.g. control flow path can't change).

Ok sorry if you knew all that already. Nice looking library; good luck! I hope we can expunge Python from AI one day.

jafioti commented 2 years ago

@Timmmm Thanks for the little write up! I think in the short term it might be easier to just have CUDA implementations of each operator, but eventually compiling whole models sounds great. It would be even better if we could surround a module with dfdx::compile() or something so that you can compile parts of the model that are known to work well, while keeping other parts in eager mode that are less stable or have changing control flow paths.

Again I think the best route for the short term implementation of operations in CUDA would be rust-cuda. I got a simple project using cust to work (that's one of their crates) where it can run a precompiled kernel from a .ptx file on data from the rust program. The problem I ran in to was compiling new kernels. You need a very specific version of CUDA and LLVM (7.0-7.4) which is very difficult to install on modern Ubuntu (at least for me), so I couldn't get it to work.

Another option I was looking at was to use Triton (https://openai.com/blog/triton/), which seems like a good option to write kernels in a python DSL, have it optimize the kernels using its own compiler, and then bring the .ptx files over. Obviously a more pure-rust solution is better, but until rust-cuda gets updated to a newer LLVM, we might be stuck with this.

coreylowman commented 2 years ago

Yeah appreciate those details @Timmmm, that's all great to know! Agree with Joe that we'd probably

@jafioti do you happen to know why you need those versions of CUDA & LLVM? that seems like a pretty big downside. Using something non-rust to compile kernels seems fine for a first pass to me. Was using the compiled ptx from rust relatively easy?

jafioti commented 2 years ago

@coreylowman I don't, but there was an explanation given on reddit when the project was announced: https://www.reddit.com/r/rust/comments/qzv428/comment/hlzkgmf/?utm_source=share&utm_medium=web2x&context=3 It says it also on the getting started page.

As far as just using the precompiled kernel, it was really easy. I could send my project later tonight, it was just a matter of storing the ptx file in a folder, pointing cust at it, copying some buffers, and launching it.

jafioti commented 2 years ago

Here's my project running precompiled CUDA kernels with pure rust: https://cdn.discordapp.com/attachments/998570812590260336/998790303320379482/cuda_runner.zip

coreylowman commented 2 years ago

Another interesting (not updated in 7 years) crate to check out: https://github.com/autumnai/collenchyma, which supports https://github.com/autumnai/leaf

AntonioLucibello commented 2 years ago

Would it be feasible to use rust-gpu or compute shaders (eg wgpu+wgsl/ piet-gpu-hal) as an alternative to CUDA?

Of course CUDA is by far the least risky and probably most performant approach, but it restricts gpu acceleration to Nvidia hardware only. In order to support other vendors you'd then have to duplicate the gpu interfacing code to make use of other APIs, which would introduce non-rust dependencies, which are a pain to work with afaict. Even then, there's the matter of which APIs to target: ROCm is not supported on older amd gpus (and not officially supported on current gen consumer grade gpus afaik), and leaves out intel and apple support. OpenCL support is generally a mess, and oneAPI is being pushed by Intel but may very well become yet another abandoned standard. I think that a deep learning framework that Just Works™ on any gpu without any weird compatibility hacks is something the ML ecosystem as a whole is currently missing, and desperately needs.

rust-gpu would allow the gpu interfacing code to be written in rust (and eventually allow a library user to write custom lambdas in rust?), but its downsides are that it requires nightly and it's a relatively young project, so it's not very battle tested. The wgpu+wgsl approach has the advantage of not requiring the user to install any proprietary toolkit such as CUDA or ROCm on their machine. The downside is that gpu interfacing code would need to be written in wgsl.

Both these approaches rely on wgpu, and benefit from its wide compatibility. Using wgpu may also allow the library to natively access the gpu in a browser setting.

jafioti commented 2 years ago

@AntonioLucibello Thanks for the writeup! I'll add some of my thoughts: I do think it would be awesome to have a Just Works library that runs on each platform, and I would be interested in taking a look at rust-gpu. When I last looked, they werre VERY unstable and too early to feasibly use. If you can whip up an example with rust-gpu that would be awesome.

I also am worried about the performance tradeoffs being made if we decided to forgo CUDA. Right now, almost no one trains on non-nvidia gpus. One reason certianly is CUDA, but I also think AMD and Intel don't really focus on ML right now, so they just aren't feasible competitors for most practitioners and companies. And if we go a compute-shader route,. we would give up a certain amount of perf that may be critical for most ppl, and so convincing people to use this over others may prove difficult.

To be clear, I have no hard solution to this, jus throwing my thoughts out there. Remember we're trying to mostly do training in this library (hence the focus on safety when building models), there's lots of great rust based ONNX inference libs out there.

Also one thing I forgot to mention, if we go the cura route we can use existing kernels from PyTorch (with attribution of course) and triton to write new kernels.

AntonioLucibello commented 2 years ago

After some research I came across this: https://developer.nvidia.com/blog/machine-learning-acceleration-vulkan-cooperative-matrices/

From what I can gather, Nvidia added support for hardware accelerated matrix operations in vulkan compute shaders. Given that Nvidia themselves made a blog post marketing this feature, the perf hit from compute shaders may be less severe than expected. Of course, without benchmarks this is just speculation. So far this feature seems to only be accessible from glsl as an extension, so until it gets added to wgsl, the wgsl+wgpu route isn't really as viable. ~Currently rust-gpu doesn't have any way to access it either afaict (as the project as a whole seems more targeted towards the graphics side of things rather than compute), but support for it could potentially be added, since SPIR-V itself supports the cooperative matrix extension.~

~As of now, the most reasonable way to go about the compute shader route would be to write shaders in glsl, compiling them to SPIR-V through naga, and running them as compute shaders with wgpu. If and when support for this extension were to be added to wgsl or rust-gpu, they should be able to replace the glsl+naga part of the setup without much change to the preexisting rust code.~

Edit: rust-gpu may be able to access this feature through inline assembly, but I have no way to test this as my machine's gpu doesn't have tensor cores.

jafioti commented 2 years ago

@AntonioLucibello I'm not sure I understand what you mean by operations vs entire model. As far as I understand, each pytorch operation is on the GPU as a CUDA kernel. These kernels can take in a tensor, where one dimension is a batch size (data parallelism). Pipeline parallelism only comes into play when you have multiple GPUs, distributed training is a whole other bag of cats entirely separate from this discussion.

I suppose in the end it would be great if we could fuse operations together at compile time all the way to making the model a single kernel, but I''m not sure if that would work.

coreylowman commented 2 years ago

@AntonioLucibello totally agree about something that just works. Agree with jafioti that supporting cuda is a necessity, but I was thinking the other day that it'd be nice to support a bunch of options. OFC every option is going to take a lot of effort, so I think we just need to start with cuda and go from there.

coreylowman commented 2 years ago

Also I'm working on a prototype using cust (the base crate of rust-cuda), will hopefully share in the next couple weeks. There's going to be a TON of work involved with all this.

coreylowman commented 2 years ago

For future reference:

tensorflow's .cu kernels are in this directory: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels
pytorch's kernels are in this directory: https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native/cuda

coreylowman commented 2 years ago

I spent a bit of time trying to prototype something using cust (from Rust-CUDA project), however ran into a number of difficulties. Notably they don't have nvrtc, which is necessary for jit compiling cuda kernels. Also unclear how active the repos are.

All that said, I've been working on a new crate cudarc that provides some safe cuda wrappers to use for dfdx. You can see some sketches of how I'm planning on integrating in cudarc/examples. Still need to add curand and cublas bindings/support to cudarc, but after that it will be in a good place to start integrating into dfdx. Then a large majority of work will be writing the cuda kernels.

XBagon commented 2 years ago

Maybe support something like onnx could be an simple(?), more flexible way of giving dfdx GPU access in some kind. While also opening up things like optimizations and interaction with other ML tools.

coreylowman commented 2 years ago

Yeah there's a separate issue for onnx support #57, and I think there's actually a rust onnx wrapper (at least for inference). For now I'm going to continue along the cuda device path since I do see that as a must have eventually. I like the idea of adding an Onnx device in the future (similar to adding OpenCL device as mentioned above).

vultix commented 2 years ago

I know it's likely much lower priority, but I'd love to see support for Apple Silicon GPUs as well. Perhaps something like wgpu compute shaders could be a relatively easy way to support m1 macs without needing to dive into the metal api?

vikigenius commented 2 years ago

Just an update in Triton land they released a inference engine library https://github.com/ELS-RD/kernl/ making use of Triton Kernel, and their benchmarks look really impressive.

If Rust Cuda is proving too painful we can look into TRITON as a possible short term solution like @jafioti till the CUDA ecosystem develops further

jafioti commented 2 years ago

@vikigenius I'm a little confused by kernl, it seems like its just a collection of kernels for running popular architectures. So the transformer is built from kernels written in Triton, which gives it its speed.

Maybe I'm confused, but it seems like we can just take those kernels, or generate new ones through triton, and run them through Rust Cuda, which should treat then as any other kernel.

vikigenius commented 2 years ago

Yeah, I took some time to look into the repository. It's just a collection of kernels. For us it's a basically a proof of concept Re triton. It looks much simpler than typical cuda kernels even though I have never worked with Triton before.

So we could either take those kernels directly or just generate ones we need using Triton. Until we have the machinery and abstractions to write Kernels in pure Rust ourselves.

kurnevsky commented 2 years ago

There is an existing library that went rust-gpu path, maybe you will find it useful: https://github.com/charles-r-earp/autograph

M1ngXU commented 1 year ago

why not mix cudnn with custom cuda kernels? tensors in cudnn are a descriptor and the data; the data being just an allocation like a cuda slice (different representations, but usually NCHW)

coreylowman commented 1 year ago

Yeah let's do that for specific kernels, it seems like for now we can use

cublas for matmul (already in cudarc)

cudnn for:

sum/max/min reductions
conv2d
min/max/avg pool2d
maybe dropout (the reserve space stuff seems slightly more complicated than just doing it ourselves. can revisit in future)

custom kernels for everything else

M1ngXU commented 1 year ago

wdym with reserve space stuff? don't we need that too to know what has been dropped out?

coreylowman commented 1 year ago

It looks easier to just allocate a CudaSlice ourselves to hold the noise and write a custom kernel than to add all the required cudnn stuff.

coreylowman commented 1 year ago

Okay as of now, all the tests on main pass for both Cuda device and Cpu, so I'm closing this as complete! 🎉

There are still other issues/conveniences when working with cuda/multiple devices that need to be done, but as of now I consider GPU support discussed in this ticket done.

Thanks to the contributions of @nkoppel, @M1ngXU, and @ViliamVadocz!

🚀 🔥 🚀 🔥 🚀 🔥 🚀 🔥 🚀 🔥 🚀 🔥

coreylowman / dfdx

GPU Mega Issue #9