Closed cems2 closed 4 years ago
What version are you using? A bug with this was fixed in the patch release 0.10.2
.
What version are you using? A bug with this was fixed in the patch release
0.10.2
.
I edited the original post to include all packages add and versions. I installed this a few days ago via the default Pkg.add process. It did not pull 10.2 it pulled 10.0. I will to see if I can get it to work with 10.2 now
Update: I just installed 0.10.2 and observe 2 things.
I no longer get the scalar index warning or the corresponding error.
However... GPU is still slower than CPU.
Ive scaled my problem to see if it's a gpu kernel load overhead. but as I scale it the ratio stats between 9 and 10 in speed (CPU faster).
Supporting this observation: the GPU utilization shows only about 16% Compared to about 80% for a non-ode problem of similar shape.
Interesting. What model are you running?
The one in the code I pasted above. So it's just a Dense 2x2 multiply and add for the neural part. I am scaling that by changing the number of points (length = 200)
t = range(tspan...,length=8) # can make the data set larger
the above code contains a comment block to switch between CPU and GPU by assigning the variable gc.
By the way one thing I have not ficured out how to do is how to lift the return from dense to the GPU directly using the cg() method im using. Instead I have to use the more blunt tool of the " |> gpu"
cg works by switching cg from Array to Flux.cu
I'm hoping I'm doing something dumb that makes the GPU slow. But in case I'm not I'm wondering if there is a way to tell Flux or DiffEq to use all the cores on the CPU processor?
Dense 2x2 matrix multiplies are too small to be optimized on the GPU. I think you need at least 100x100 matmuls to get speed? Just try the CuArrays directly and see where it's faster.
Also, adaptive ODE solving has a mapreduce in the norm calculation, so that's less optimized right now than the matmul, but we are going to update the CuArray mapreduce sometime in January to improve it. But that shouldn't be a "major" issue: it just pushes up the neural network size that is required to get acceleration.
But in case I'm not I'm wondering if there is a way to tell Flux or DiffEq to use all the cores on the CPU processor?
BLAS is already multithreaded, so matmuls should be multithreaded without changing anything. Just do a big matmul and check your resource utilization.
Oh. Is Dense() microcode optimized for BLAS? that is if I write my own Dense in plain Julia I don't see offhand how Julia would "know" to coalesce the multiply step with the add step like a BLAS can do ( and thus exploit the Intel AVX vectors on the CPU). ( I am doing things like making a resnet dense directly in julia)
Julia matrix multiplies just lower to BLAS. So
A = rand(1000,1000)
B = rand(1000,1000)
A*B
will use BLAS, and cu
on both will use CUBLAS.
UPDATE See comments below for the resolution of this problem
It appears Flux training of the DiffEqFlux Package method neural_ode is incompatible with (fast) GPU operations. _But I'm hoping that perhaps I'm doing something wrong and you can point me to example code where gpu ops are fast when using neuralode.
But in my simple tests GPU is 10x slower than CPU on a RTX2070 and intel i7
CPU timing: 1.287959 seconds (8.92 M allocations: 358.429 MiB, 7.67% gc time)
GPU timing 12.238495 seconds (52.03 M allocations: 1.937 GiB, 5.78% gc time)
GPU utilization is lower when using neural_ode than a classic Neural net Monitoring the nvidia-smi I see ~3 to 15% activity for the julia process!!!!!!!!!! But if I run it with a simple ANN instead of a neural_ode I get GPU activities >80% utilization.
While this suggest the problem is in neural_ODE and not Flux, i note the following counter evidence. The warning listed below does not occur when simply running a neural_ode. It only happens when a neural_ode is being used by FLux.
Possibly related bug reports
65 neural_ode solve leaves GPU
21 Trouble running the neural ODE example on GPU
11
and this discourse https://discourse.julialang.org/t/code-using-flux-slow-on-gpu/30696
Guesses about Likely Cause: Low utilization is frequently caused by either arrays transiting repeatedly between gpu and cpu memory, or non-parallel operations (often a result of indexed loops rather than vector ops). Support for the former comes the timing-data above: the GPU uses vastly more memory & allocations, Suggesting perhaps that arrays are being mirrors between gpu and cpu unnecessarily.
Support for the latter hypothesis --indexed looping-- comes from the error/warning. Using neural_ODE with Flux.train!() gives me a warning that scalar ops are very slow but code runs.
if I assert CuArrays.allowscalar(false) then I get an error raised instead of a warning in Flux.train! that relates to setting an index.
Since the traceback show the warning in indexing.lj passes through Tracker early on, I have tried looking at the code there to see if I can figure out why but I can't understand it. ########### I'm going to paste some code to reproduce this below, followed by the error traceback
Installed Package versions in Julia 1.2
Code to Reproduce problem
Here is the error when CuArrays.allowscalar(false) is asserted in gpu mode (cg==cu)