FluxML / Flux.jl

Relax! Flux is the ML library that doesn't make you tensor
https://fluxml.ai/
Other
4.44k stars 601 forks source link

ArrayFire #1126

Open clive-g-brown opened 4 years ago

clive-g-brown commented 4 years ago

Have you thought about an ArrayFire back end [as well as CUDA) ? would make Flux usable on AMD devices [Mac].

bhvieira commented 4 years ago

What would need to be done? ArrayFire extends AbstractArrays, it looks like it could work out of the box. Have you tested it? I sadly don't own an AMD gpu to test it.

clive-g-brown commented 4 years ago

dunno, not my area of expertise but im happy to fork and have a go.

clive-g-brown commented 4 years ago

I tried the examples https://fluxml.ai/Flux.jl/stable/gpu/#

but using ArrayFire equivalents - all fine. difficult to tell if its doing anything on GPU [yet].

bhvieira commented 4 years ago

Have you seen #938 as well? It's based on ROCArrays, it gives a good idea on how to implement things and to test if they work

bhvieira commented 4 years ago

difficult to tell if its doing anything on GPU [yet].

Do some huge, repeated, matmul. Like a Dense(500, 500) on a 500x1000 random dataset or something similar and check gpu utilization, will give you a good idea if it works.

clive-g-brown commented 4 years ago

did something like that. AFarrays() seem to work ! but |>gpu, not so sure. ROCArrays doesn't compile (OSX) - other packages also seem defunct.

bhvieira commented 4 years ago

gpu is simply syntactic sugar, if you can define your layers on AFArray params and they work then it's a good sign.

clive-g-brown commented 4 years ago

got it working. as you say, define the layers with ArrayFire, but also have to define the datasets as array fire. AF only support sigmoid AF. but if I do all that - works on AMD, 0.00017s for a given dataset versus 0.034 on cpu.(|>gpu doesn't do anything). Some of the speed difference there is undoubtedly the lack of latency on the async GPU calls.

bhvieira commented 4 years ago

0.0001s for a given dataset versus 0.04 on cpu

Cool!

|>gpu doesn't do anything

Yeah, you'd need to change a bit of code for it to work. gpu sends your arrays to Cuda, basically, that's all it does.

DhairyaLGandhi commented 4 years ago

We could get it to offload to AF, but yeah, just sugar above cuda

bhvieira commented 4 years ago

We could get it to offload to AF, but yeah, just sugar above cuda

I've been thinking of setting up GPU backends for Flux, either baking them into the package (a la Plots, with the many backends) or splitting it up into lightweight libraries (eg FluxCuda, FluxArrayFire and FluxROCm) for example. Then it's up to the user to load the correct one, and then the gpu method would be the correct one as well.

bhvieira commented 4 years ago

By the way @clive-g-brown, I'm not sure how AFArrays compare to CuArrays, but you might need to reconvert them to Arrays to save it to disk safely.

clive-g-brown commented 4 years ago

Great idea on the multiple back ends. You’d reach Mac users like me.

DhairyaLGandhi commented 4 years ago

I've been thinking of setting up GPU backends for Flux, either baking them into the package (a la Plots, with the many backends) or splitting it up into lightweight libraries (eg FluxCuda, FluxArrayFire and FluxROCm) for example. Then it's up to the user to load the correct one, and then the gpu method would be the correct one as well.

Or just do it through Requires? Similar to CUDAdrv.jl which might be cleaner and not require maintaining as many parallel glue packages.

bhvieira commented 4 years ago

I'll be honest here and say that I had never read into Requires 😋 But looking the package repository, that seems like it @dhairyagandhi96 Not sure about the intelligent way to make it work if the user has CuArrays and ArrayFire installed at the same time

clive-g-brown commented 4 years ago

By the way @clive-g-brown, I'm not sure how AFArrays compare to CuArrays, but you might need to reconvert them to Arrays to save it to disk safely.

it has its own save/load [I do wonder what serialize would do], the wrapping seem more extensive than is documented on the front page - all the bits are there for more activation functions.

https://github.com/JuliaGPU/ArrayFire.jl/blob/master/src/wrap.jl

clive-g-brown commented 4 years ago

I only got so far with this, when I try LSTM layers it breaks on broadcasting.

bhvieira commented 4 years ago

Did your code work on Cuda/CPU before? There are some things on broadcasting that became harder to get right with Zygote, I've seen other users commenting on it.

clive-g-brown commented 4 years ago

CPU is fine. I don't have a CUDA test setup. ill have to reimplement. No its this : error("Use broadcasting (", $(string(f)), ".(x)) to apply activation functions to arrays.") getting fmap to flip everything to gpu seems to be the issue, its still using Flux.sigmoid instead of the GPU one.

amtapani commented 4 years ago

AF seems to take some liberties with broadcasting. It also trips over many basic NNlib functions because of that. Looking at the source, it appears to convert e.g. relu.(x) back to relu(x) and assume the function can run in vectorized form without explicit broadcast, which leads to that error. I suspect that's what's going on, anyway. It's a bit hard to tell, since it uses a global flag to control the broadcast somehow.

AF also has trouble with Zygote.gradient, since it has a try-catch block inside broadcasted(), which evidently isn't allowed. Then there are some arithmetic oddities that don't quite match julia. E.g. you can do A*X with nd-arrays and it will matrix multiply the deepest dimensions, which Julia doesn't do out of the box. Might cause inconsistent behaviour.

I've been doing a bit of feasibility assessment on either improving ArrayFire, reviving CLArrays or creating a new package from scratch with somewhat different approach. I'll probably have some extra time over the summer, so I'm giving it a proper go at least. Honestly, I'm leaning towards the last option. AF seems like a good idea at a glance, but I don't think I'd feel comfortable relying on it for anything non-trivial, and CLArrays might be too much for me to untangle. This just as a tentative heads-up in case someone has similar plans. Maybe we can compare notes.

clive-g-brown commented 4 years ago

thanks for that, useful to know. I hadn't grokked it completely. The issue is using AMD GPU, compulsory on Mac. So a MetalArrays would sort that. But that wouldn't work so well on Win/LINUX. PlaidML has a nice system, chose a backend which includes Metal and OpenCL. I don't know if their tile system is wrappable or compatible although it clearly works with python/keras/tf.

AriMKatz commented 4 years ago

@jpsamaroo Is doing stuff with AMD arrays. There's also been talk of Metal GPU codegen by @PhilipVinc on the #GPU slack

amtapani commented 4 years ago

That still leaves out at least Intel and ARM hardware for most platforms and all non-nvidia hardware for windows (until ROCm support is expanded a bit). I think OpenCL still has some of mileage in it, at least until the remaining parties unveil their tailored apis for select OSes. We'll have a dozen back-ends to support in few years.

I actually briefly looked into hacking Tensorflow.jl to use PlaidML via nGraph. I believe that would work, but it might take more effort than making a full backend from scratch. It's also not very Julian way to do things.

clive-g-brown commented 4 years ago

That’s intriguing, ill have a look at that over the weekend.

jpsamaroo commented 3 years ago

I don't think ArrayFire is a good way forward for a language like Julia. As we can see from the above conversation, ArrayFire.jl has to do a variety of "unholy" things to be able to dispatch to the ArrayFire library, and is limited by what ArrayFire's underlying library is built to do. What I see as the best way forward is the following:

J1MC83N commented 3 years ago

I just started learning Flux and has been trying to get ArrayFire to work. AFArray doesn't seem to have in-place operations and cannot be changed without altering the objectid, but then Params work on IdDicts. Is there a workaround to this?