Closed CorySimon closed 8 years ago
I don't see any big problem in the generated code. It is faster with @acc
on my machine:
Without @acc
0.018236 seconds (1.40 k allocations: 15.301 MB, 24.15% gc time)
With @acc
0.002088 seconds (5.40 k allocations: 189.938 KB)
We need accurate performance analysis but this problem size is too small to justify parallelism overheads (e.g. cache communication) I think.
One option is including more computation and/or data initialization in the @acc
function which reduces the parallelism overheads. You could also turn off parallelism and benefit from other optimizations with CGEN_NO_OMP=1
environment variable.
Hi @CorySimon, thanks for trying it out. Can you say something about the machine you're running this on, such as number of cores, etc? Also: which Julia version do you have? Which external C++ compiler are you using? Are you using MKL? Are you using OpenMP? All these things could play a role in performance.
On my desktop Ubuntu machine (4 physical cores) with Julia 0.4.4-pre, using ICC and MKL, and OpenMP, I seem to be getting:
julia> include("/home/lkuper/example.jl")
Without @acc
0.013547 seconds (1.40 k allocations: 15.301 MB, 20.32% gc time)
With @acc
0.002870 seconds (5.40 k allocations: 189.938 KB)
with the code as written.
Just to chime in, I see 0.024s and 0.018s for "without acc" and "with acc" respectively on the machine I tried. Also, note that you don't need two versions of the function. You only need one and when you want to run without the optimizations you can use "@noacc f2(...)". For 100x the problem size, I see 1.7s versus 0.15s.
That is impressive speed-up! I am now optimistic that this will help my code (and research!).
@lkuper How do I control which C++ compiler I am using for ParallelAccelerator, or if I am using OpenMP? I export OMP_NUM_THREADS=8
. I did not install MLK or ICC.
My specs are:
Julia 0.4.2
Ubuntu 14.04.4 LTS, x86_64-linux-gnu, Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
cat /proc/cpuinfo | grep processor | wc -l
yields 8
gcc --version
yields gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
@DrTodd13 Thanks for the tip.
EDIT: I now see that export CGEN_NO_OMP=1
in the terminal will turn off OpenMP. And, for the result in my first post, I was using OpenMP since this variable was not in my environment.
@CorySimon When you run ParallelAccelerator.build()
, ParallelAccelerator will look for ICC first and if it can't find it, will look for GCC. In your case it fell back to using GCC. It will also look for MKL and try to use it if it's present. The build script that controls this is build.sh.
Your machine looks very much like mine (your 8 cores are probably 4 physical cores). Out of curiosity, I just manually disabled ICC and MKL on my machine (by commenting out parts of build.sh -- ideally, one would be using env vars to do this) and tried with Julia 0.4.2 and GCC, and I got a slowdown with @acc
(I have GCC 4.8.5, but I'm not sure that matters much):
julia> include("/home/lkuper/example.jl")
Without @acc
0.014792 seconds (1.40 k allocations: 15.301 MB, 29.65% gc time)
With @acc
0.121271 seconds (5.40 k allocations: 189.938 KB)
However! On Julia 0.4.4-pre, even with GCC and no MKL, same as above, I get:
julia> include("/home/lkuper/example.jl")
Without @acc
0.011174 seconds (1.40 k allocations: 15.301 MB)
With @acc
0.002982 seconds (5.40 k allocations: 189.938 KB)
So one thing to try next is to upgrade your Julia and see what happens. :)
Most likely you're using OpenMP unless you did something to turn it off explicitly. You'll see a message "OpenMP is not used." from ParallelAccelerator at runtime if it is disabled.
@lkuper Thank you! I updated Julia, and I see modest improvements with @acc
now.
/opt/intel/
had files in it already, I installed in /opt/intel2/
instead. Now, when I Pkg.build("ParallelAccelerator")
, it says that it is still using g++ and it does not detect MKL. I ran /opt/intel2/mkl/bin/mklvars.sh
, but it did not help it find MKL. Is there some environment variable I need to set in my ~/.bashrc
for it to find MKL and the Intel compiler?@acc
my code, I replaced the commented lines in the highlighted code with the @acc
macro-ed lines here. I got an error OptFramework failed to optimize function .* in optimization pass ParallelAccelerator.Driver.toCGen with error AssertionError("CGen: variable #2022#v cannot have Any (unresolved) type")
, suggesting type instability.
However, I declared the type of k_dot_dx
and declared the type of the charges
attribute of framework
in this file here, so all types should be known, right?Thank you!
@CorySimon For MKL, you should make sure that /opt/intel2/mkl
is in your LD_LIBRARY_PATH
. I'd suggest also setting MKLROOT
to /opt/intel2/mkl
(although mklvars.sh should already do this).
@CorySimon I looked at the code you linked, and saw that you were using @acc
at call site. But it is recommended to always use @acc
for function declarations, instead of using it at call site, which was not mentioned in our User Guide and most likely will not work the way you intended.
The usual practice is just to extract the piece you want to run @acc
as its own function, and have @acc
in front of its definition. Hope it helps.
@CorySimon Are you still having problems, or is it OK to close this issue?
@ninegua, @lkuper Yes, thank you!
Lesson: use the latest version of Julia for @acc
to speed up your code, and put @acc
in front of functions, not call sites.
👍
First, really great work!
I tried using
@acc
in my Julia code, but it caused my code to slow by 50 times.Here is a toy example that resembles what I am doing in my code:
The result is:
Why is this so much slower with
@acc
? Oddly, there are more allocations with@acc
but less memory used in these allocations. When I increase the size of the arraysx
andy
,@acc
becomes faster. This must just be the overhead from the parallelization? I imagine thesum
incurs a cost for parallelization but the first two lines in the function should not. So there is only a payoff with@acc
when the size of the array is very large?