[Feature] I can write code with large tensors, and the derivative runs as fast as PyTorch on GPU

Mikolaj / horde-ad

Higher Order Reverse Derivatives Efficiently - Automatic Differentiation library based on the paper "Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation"

BSD 3-Clause "New" or "Revised" License

32 stars 6 forks source link

[Feature] I can write code with large tensors, and the derivative runs as fast as PyTorch on GPU #43

Open Mikolaj opened 2 years ago

Mikolaj commented 2 years ago

Desiderata:

[ ] MNIST example with MatMul only
[ ] MNIST example with convolutions
[ ] ...
[ ] GPT-3 on 64 GPUs

Supposedly, this can be done rather cheaply, offloading most of the work to LLVM and Haskell packages working with it.

My loose notes from a chat with Alp Mestanogullari:

you can always write instances for ASTs and JIT-compile algorithms built on top of your differentiation mechanics

https://github.com/llvm-hs/llvm-hs-examples/blob/master/arith/Arith.hs

it's essentially scalar functions of just one (scalar) variable

and you can keep in mind that this could just be taken down to the GPU path instead with llvm-ptx

the above example should look fairly simple to you, despite your lack of familiarity with llvm-hs

https://github.com/llvm-hs/llvm-hs-examples/blob/master/arith/Arith.hs#L266 <- this is me building the LLVM AST/module for my little expression type.

the rest is just plumbing.

so yeah if you want to take performance up a notch, I'd explore going down that route. you can still use/refer to external functions (BLAS and all), as long as you link to them, but this gives you a chance to optimize your expressions and emit optimal machine code for your computations, like a bunch of modern stuffs do

and imagine staging this whole process (i.e doing some TH to run the codegen and linking the machine code in with your haskell executable) so that none of that compilation happens at runtime... pretty ideal world here.

https://github.com/google-research/dex-lang#dex- <- the paper for this was very interesting btw, if you like thinking about ASTs/DSLs/compilation/numerics/AD/etc

Mikolaj commented 2 years ago

Related transcript from datahaskell/lobby (https://matrix.to/#/!JSHTXUPHaIJhjEINXX:matrix.org?via=m.topoi.dev&via=gitter.im&via=matrix.org) follows the line:

anton_5 Man of Letters: you may also want to try to use array fire https://github.com/arrayfire/arrayfire-haskell instead of hmatrix for gpus.

Staging would be really nice to have but my understanding is that Haskell doesn’t really have good support for staged metaprogramming right now. Anyways, I think that using array fire would be a much quicker way to have your lib working with gpus, so it could be a good starting point. Also, what is nice about arrayfire is that at some point I managed to run it even on intel’s built-in gpu (in my laptop) so it’s not limited only to Nvidia cards.

https://iagoleal.com/posts/calculus-symbolic-ode/

if going via the staging route, I think it’s worth exploring the possibility of emitting MLIR https://mlir.llvm.org/. I have just found MLIR Haskell bindings! :) https://github.com/google/mlir-hs.

Mikolaj commented 1 year ago

https://github.com/arrayfire/arrayfire-haskell looks great, in particular, by implementing the same vast set of operations on CPU (using various extensions) and GPU, transparently.

The convolution operations are described in the language of signal processing and it's not obvious if they are performant enough for ML (e.g., a dimension is said to be flipped, but for ML it's unneeded, so correlation is used, in fact, but here correlation is not available)

Functions are not shape-typed (and indeed they dynamically produce arrays of different sizes depending on complex conditions on arguments, even if theoretically these could be statically derived). Arrays have only up to four dimensions. Unlike hmatrix, it doesn't just use the storable vectors as the underlying representation, but requires a costly conversion (allocating a copy of the whole array) even for read-only use. I don't know if the memory representation is different (I'd guess not) or if that's just a design decision for the particular FFI bindings, cheaply ensuring that arrays are always copied, even if they are used read-only. Or perhaps the internal representation differs for each backend?

I think we could use this library instead of hmatrix, but we'd need to benchmark the two to make sure CPU performance is not much worse. Ideally, performance would remain acceptable even if we insist on converting to storable vectors on each gradient descent iteration. An option is to add more ranks (arrayfire untyped and arrayfire typed) and extend the optimizers to cope with the new representations.

I wonder, are 4 dimensions enough?

Mikolaj commented 1 year ago

Actually the problem with costly conversion to storable vectors is bigger than once per gradient descent slowdown. Every numerical operation on orthotope tensors is implemented by converting to hmatrix and back, which is cheap (cheap only for vectors and matrices, because hmatrix doesn't support 3 and 4 dimensions, so that is implemented by iterating over a vector of matrices, so there ArrayFire would have an advantage). Consequently, each numerical operation would bear the extra cost. Which means we may need to implement a version of orthotope backed by ArrayFire arrays instead of storable vectors, in addition to extending/changing our set of implemented ranks.

I'm not so keen on this, especially that the Haskell bindings to ArrayFire had one or two commits since 2019. Is the C API so stable? Is it, because the C API is so comprehensive and well done? Or are the Haskell bindings behind the C API and that's why correlation is not available and only the signal processing-style convolution is? Checking with which GHC it works and how fast it can be ported would be a good indication of the Haskell bindings maintenance status. Or, a more direct route may be asking the kind maintainer @dmjio. :)

Now that I think about it, different array representations for different backends are necessary if the data is to stay in GPU in-between individual arithmetic operations. Sending data back to CPU (so that the Storable Vector pointer can be created) after each matmul operation would be a performance disaster. To fake the Storable Vector representation at no cost, we'd need the sending to CPU to be lazy, with a rewrite rule that if the data is sent back unchanged to GPU, no sending is performed at all. But it's IMHO more natural to just reimplement with ArrayFire the few orthotope operations that touch the underlying vector (most orthotope operations only tweak the dimensions list, by design).

dmjio commented 1 year ago

Hello @Mikolaj 👋🏼 😄

Regarding the Haskell ArrayFire bindings, since reading your comment I've updated the bindings to support the latest arrayfire version 3.8.2, and have tested them with GHC 9.4.2. More info on that here.

In general the arrayfire C API seems stable. New functions are added, but existing functions do not seem to have breaking interface changes. There are three levels of bindings, the low-level bindings are generated automatically by parsing the include/af/*.h files, the mid + high are written manually, with the help of some utility functions.

Regarding the convolve function (among others) I believe you're referring to, I will regenerate the low-level bindings to incorporate these new functions and make a 7.0.1.0 release. In the meantime, the latest version of arrayfire v3.2.8 should be usable from the current Haskell bindings.

The nix build isn't well supported any longer, but installing arrayfire with your package manager of choice on Linux / Darwin (and building with cabal or stack) seems to work well. Documentation is uploaded manually since the hackage build server doesn't have arrayfire installed (to my knowledge). Very few dependencies are required so it works pretty well across versions.

More than happy to continue to support arrayfire-haskell, and collaborate on high-level binding features (time allowing) if you choose to use it with horde-ad 🍻 .

Mikolaj commented 1 year ago

@dmjio, this is amazing news. If we add Haskell ArrayFire to the pool of our backends, I will certainly badger you with requests and, let's hope, also patches. And I'm already grateful that you grant us the option.

BTW, do you happen to know why the underlying data container in Haskell ArrayFire is an abstract pointer, instead of Storable Vector (with a primitive Haskell Storable array under the hood), as in hmatrix or orthotope and some others? Also, any long term plans for a dependently typed sized (shaped) very high level API, similar to orthotope?

dmjio commented 1 year ago

BTW, do you happen to know why the underlying data container in Haskell ArrayFire is an abstract pointer, instead of Storable Vector (with a primitive Haskell Storable array under the hood), as in hmatrix or orthotope and some others?

My understanding is that the operations exposed via the C API first construct an AST that gets JIT compiled into code that minimizes the number of GPU round trips and memory allocations. So one perspective is that the ArrayFire API allows you to construct thunks that are lazily evaluated on the target backend (GPU or CPU), and efficiently.

@9prady9 and the team can go into more details I'm sure.

These resources helped me understand more of the lazy evaluation infrastructure

But I'm pretty sure this is the reason for the abstract type.

Also, any long term plans for a dependently typed sized (shaped) very high level API, similar to orthotope?

Definitely open to it. I know hmatrix has a typed API, in practice teams I've been on didn't use it much, but I'm open to adding it to arrayfire because I do think it can really help in certain circumstances.

Mikolaj commented 1 year ago

I gathered some more feedback on ML for GPU. Staging (JIT) to MLIR repeats a lot, due to how costly GPU-CPU communication is and how cool MLIR is. However, I'd prefer, for a start, not to handle staging myself, but outsource it. It seems ArrayFire does just that, though probably not with MLIR (it predates MLIR?). So let's try out ArrayFire and benchmark it against hmatrix #71.

@dmjio, thank you for the explanation and the links. So it seems the conversion of the abstract pointer to a concrete RAM array is precisely the cue for ArrayFire to JIT and compute. If so, it may be reasonable (and perhaps optimal) to JIT once per the total gradient calculation in the gradient descent loop. More often than that increases the communication overhead. Less often keeps too much intermediate data in memory. In other words, converting the ArrayFire pointer to a RAM array whenever the main gradient descent state is updated, about which I worried previously, may indeed incur an unnecessary transfer from GPU, but at least it prompts JIT at the right moment.

So the only substantial problem that remains is porting orthotope to arbitrary underlying tensor representation #72 instead of assuming a RAM array (Storable Vector). That's because triggering JIT after each single operation on tensors would be a performance disaster. But the port is not needed for ArrayFire vs hmatrix benchmarks, which can be done without type-level shape checking.

Definitely open to it. I know hmatrix has a typed API, in practice teams I've been on didn't use it much, but I'm open to adding it to arrayfire because I do think it can really help in certain circumstances.

I don't have experience with that one, but I don't see the use of https://hackage.haskell.org/package/ghc-typelits-natnormalise in the typed hmatrix code examples. From my experience with orthotope, without the plugin, type-checking any larger code is a nightmare and the version that finally type-checks is terribly bloated (basically carrying lots of repeated trivial arithmetic statements like n + 1 ~ 1 + n in its type signatures). Given that GHC support for such plugins is only now becoming mature, perhaps it's the right moment to retry shape-checking.

BTW, is the restriction to 4 dimensions something deep-seated or is it easy to lift? I never used more dimensions, but that's a mismatch vs orthotope API and also one more thing to warn users about.

dmjio commented 1 year ago

It seems ArrayFire does just that, though probably not with MLIR (it predates MLIR?)

I believe this to be true.

Given that GHC support for such plugins is only now becoming mature, perhaps it's the right moment to retry shape-checking.

If GHCs type checker will be too slow this sounds ideal. Also, might be best to newtype Array with type level dimension information and re-export a typed variant module for all current modules. Just a thought.

BTW, is the restriction to 4 dimensions something deep-seated or is it easy to lift?

It seems limited to 4 dimensions currently.

Seeing benchmarks against hmatrix would be exciting 😎

Mikolaj commented 1 year ago

Note to self regarding ArrayFire (and may be totally wrong): it's seems the API does not contain anything like generate or build and also map. This probably means we'd need to perform these operations in Haskell-land. This incurs transfers of the input and output data between Haskell-land and ArrayFire-land. Also, we can't count on ArrayFire optimizing such operations (e.g., selectively vectorize and fuse, depending on the capabilities of the backend). Perhaps us vectorizing such code before running it via ArrayFire might help and perhaps ArrayFire can fuse back some of that, as needed.

While googling for that, I noticed that Haskell CUDA bindings (https://hackage.haskell.org/package/cuda) seem to be even more low level, so that's probably a common problem (and, frankly, what was I thinking, how would ArrayFire or CUDA understand Haskell closures?). OTOH, Julia has bindings both for CUDA and ArrayFire and I wonder if they contain build or map (https://discourse.julialang.org/t/arrayfire-vs-abstractarray-performance-and-future-in-julia/60533). I haven't looked up MLIR, but it may be similarly restricted and likely generate and map is not available even for MLIR or LLVM intermediate language functions, not to mention Haskell closures.

dmjio commented 1 year ago

Just a slight note on the array construction within ArrayFire. The .Data module should contain some ways to construct an Array natively on the GPU (bypassing the Haskell heap). These are kind of like predefined ways to get an identity matrix, tiling, etc.

https://hackage.haskell.org/package/arrayfire-0.7.0.0/docs/ArrayFire-Data.html

Tangentially, there are ways to save an Array to disk and read it back into the GPU. https://hackage.haskell.org/package/arrayfire-0.7.0.0/docs/ArrayFire-Util.html#v:saveArray

But yea, for custom Array construction (via Vector or other) you'd have to cross the Haskell boundary most likely (seen in the .Array module). It should be calling into the af_create_array C function.

https://arrayfire.org/docs/group__c__api__mat.htm#ga834be32357616d8ab735087c6f681858

Mikolaj commented 1 year ago

It's worth investigating if https://github.com/twesterhout/halide-haskell permits generation of CUDA of OpenCL code (for the user to, e.g., compile fro some exotic embedded device) and if so, it may well be a more convenient alternative to MLIR.