IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

50x slow-down with @acc #86

Closed CorySimon closed 8 years ago

CorySimon commented 8 years ago

First, really great work!

I tried using @acc in my Julia code, but it caused my code to slow by 50 times.

Here is a toy example that resembles what I am doing in my code:


using ParallelAccelerator

function f1(x::Array{Float64}, y::Array{Float64})
   x = cos(x)
   x = x .* y
   energy = sum(x)
   return energy
end

@acc function f2(x::Array{Float64}, y::Array{Float64})
   x = cos(x)
   x = x .* y
   energy = sum(x)
   return energy
end

x = rand(10000)
y = rand(10000)

# warmup
f1(x, y)
f2(x, y)
println("Without @acc")
@time [f1(x, y) for i = 1:100]
println("With @acc")
@time [f2(x, y) for i = 1:100]

The result is:

Without @acc
  0.010626 seconds (1.40 k allocations: 15.301 MB)
With @acc
  0.330101 seconds (5.40 k allocations: 189.938 KB)

Why is this so much slower with @acc? Oddly, there are more allocations with @acc but less memory used in these allocations. When I increase the size of the arrays x and y, @acc becomes faster. This must just be the overhead from the parallelization? I imagine the sum incurs a cost for parallelization but the first two lines in the function should not. So there is only a payoff with @acc when the size of the array is very large?

ehsantn commented 8 years ago

I don't see any big problem in the generated code. It is faster with @acc on my machine:

Without @acc
  0.018236 seconds (1.40 k allocations: 15.301 MB, 24.15% gc time)
With @acc
  0.002088 seconds (5.40 k allocations: 189.938 KB)

We need accurate performance analysis but this problem size is too small to justify parallelism overheads (e.g. cache communication) I think.

One option is including more computation and/or data initialization in the @acc function which reduces the parallelism overheads. You could also turn off parallelism and benefit from other optimizations with CGEN_NO_OMP=1 environment variable.

lkuper commented 8 years ago

Hi @CorySimon, thanks for trying it out. Can you say something about the machine you're running this on, such as number of cores, etc? Also: which Julia version do you have? Which external C++ compiler are you using? Are you using MKL? Are you using OpenMP? All these things could play a role in performance.

On my desktop Ubuntu machine (4 physical cores) with Julia 0.4.4-pre, using ICC and MKL, and OpenMP, I seem to be getting:

julia> include("/home/lkuper/example.jl")
Without @acc
  0.013547 seconds (1.40 k allocations: 15.301 MB, 20.32% gc time)
With @acc
  0.002870 seconds (5.40 k allocations: 189.938 KB)

with the code as written.

DrTodd13 commented 8 years ago

Just to chime in, I see 0.024s and 0.018s for "without acc" and "with acc" respectively on the machine I tried. Also, note that you don't need two versions of the function. You only need one and when you want to run without the optimizations you can use "@noacc f2(...)". For 100x the problem size, I see 1.7s versus 0.15s.

CorySimon commented 8 years ago

That is impressive speed-up! I am now optimistic that this will help my code (and research!).

@lkuper How do I control which C++ compiler I am using for ParallelAccelerator, or if I am using OpenMP? I export OMP_NUM_THREADS=8. I did not install MLK or ICC.

My specs are: Julia 0.4.2 Ubuntu 14.04.4 LTS, x86_64-linux-gnu, Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz cat /proc/cpuinfo | grep processor | wc -l yields 8 gcc --version yields gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4

@DrTodd13 Thanks for the tip.

EDIT: I now see that export CGEN_NO_OMP=1 in the terminal will turn off OpenMP. And, for the result in my first post, I was using OpenMP since this variable was not in my environment.

lkuper commented 8 years ago

@CorySimon When you run ParallelAccelerator.build(), ParallelAccelerator will look for ICC first and if it can't find it, will look for GCC. In your case it fell back to using GCC. It will also look for MKL and try to use it if it's present. The build script that controls this is build.sh.

Your machine looks very much like mine (your 8 cores are probably 4 physical cores). Out of curiosity, I just manually disabled ICC and MKL on my machine (by commenting out parts of build.sh -- ideally, one would be using env vars to do this) and tried with Julia 0.4.2 and GCC, and I got a slowdown with @acc (I have GCC 4.8.5, but I'm not sure that matters much):

julia> include("/home/lkuper/example.jl")
Without @acc
  0.014792 seconds (1.40 k allocations: 15.301 MB, 29.65% gc time)
With @acc
  0.121271 seconds (5.40 k allocations: 189.938 KB)

However! On Julia 0.4.4-pre, even with GCC and no MKL, same as above, I get:

julia> include("/home/lkuper/example.jl")
Without @acc
  0.011174 seconds (1.40 k allocations: 15.301 MB)
With @acc
  0.002982 seconds (5.40 k allocations: 189.938 KB)

So one thing to try next is to upgrade your Julia and see what happens. :)

Most likely you're using OpenMP unless you did something to turn it off explicitly. You'll see a message "OpenMP is not used." from ParallelAccelerator at runtime if it is disabled.

CorySimon commented 8 years ago

@lkuper Thank you! I updated Julia, and I see modest improvements with @acc now.

  1. I installed the Intel C++ compiler and MKL. As /opt/intel/ had files in it already, I installed in /opt/intel2/ instead. Now, when I Pkg.build("ParallelAccelerator"), it says that it is still using g++ and it does not detect MKL. I ran /opt/intel2/mkl/bin/mklvars.sh, but it did not help it find MKL. Is there some environment variable I need to set in my ~/.bashrc for it to find MKL and the Intel compiler?
  2. To @acc my code, I replaced the commented lines in the highlighted code with the @acc macro-ed lines here. I got an error OptFramework failed to optimize function .* in optimization pass ParallelAccelerator.Driver.toCGen with error AssertionError("CGen: variable #2022#v cannot have Any (unresolved) type"), suggesting type instability. However, I declared the type of k_dot_dx and declared the type of the charges attribute of framework in this file here, so all types should be known, right?

Thank you!

lkuper commented 8 years ago

@CorySimon For MKL, you should make sure that /opt/intel2/mkl is in your LD_LIBRARY_PATH. I'd suggest also setting MKLROOT to /opt/intel2/mkl (although mklvars.sh should already do this).

ninegua commented 8 years ago

@CorySimon I looked at the code you linked, and saw that you were using @acc at call site. But it is recommended to always use @acc for function declarations, instead of using it at call site, which was not mentioned in our User Guide and most likely will not work the way you intended.

The usual practice is just to extract the piece you want to run @acc as its own function, and have @acc in front of its definition. Hope it helps.

lkuper commented 8 years ago

@CorySimon Are you still having problems, or is it OK to close this issue?

CorySimon commented 8 years ago

@ninegua, @lkuper Yes, thank you! Lesson: use the latest version of Julia for @acc to speed up your code, and put @acc in front of functions, not call sites.

lkuper commented 8 years ago

👍