JuliaArrays / StaticArrays.jl

Statically sized arrays for Julia
Other
773 stars 150 forks source link

Haskell accelerate is intriguingly faster in some cases? #258

Open c42f opened 7 years ago

c42f commented 7 years ago

As mentioned in passing on julialang slack, the following is an interesting and carefully considered blog post by @idontgetoutmuch, mainly about Haskell, but with an addendum comparing the speed of the resulting code to a julia implementation using StaticArrays:

https://idontgetoutmuch.wordpress.com/2017/06/02/1090/

It's a lot of LLVM assembly, but it's intriguing that there's possibly still a 2x speedup or so that we might be missing out on even for the following simple code (my interpretation of julia @idontgetoutmuch's implementation in julia idiom, possibly with added bugs):

@inline function oneStep(h, q, p)
    @inline nablaQQ(q) = q/norm(q)^3
    @inline nablaPP(p) = p
    p2   = p - h/2 * nablaQQ(q)
    qNew = q + h * nablaPP(p2)
    pNew = p2 - h/2 * nablaQQ(qNew)
    return qNew, pNew
end

function manyStepsFinal(f, n, h, q, p)
    for i in 1:n
        q, p = f(h, q, p)
    end
    return q,p
end

e = 0.6
q10 = 1 - e
q20 = 0.0
p10 = 0.0
p20 = sqrt((1 + e) / (1 - e))

h = 0.01

q0 = SVector{2,Float64}(q10, q20)
p0 = SVector{2,Float64}(p10, p20)

manyStepsFinal(oneStep, 1, h, q0, p0) # jit warm up
@time qf, pf = manyStepsFinal(oneStep, 10_000_000, h, q0, p0)

I tried a lot of tweaks to get the 2x speedup, including running with julia -O3 --math-mode=fast, and writing the oneStep() function in vectorized vs non-vectorized LLVM assembly by hand. None of these made things appreciably faster than just using the version above with julia-0.6. I'd probably need to actually benchmark the Haskell version locally to make any more progress...

idontgetoutmuch commented 7 years ago

I thought the most likely explanation is as outlined on my blog:

Now it is entirely possible that this results in usage of different libMs, the Julia calling openlibm and GHC calling the system libm which on my machine is the one that comes with MAC OS X and is apparently quite a lot faster. We could try compiling the actual llvm and replacing the Julia calls with pow but maybe that is the subject for another blog post.

http://info.prelert.com/blog/os-x-maths-functions-are-the-fastest

c42f commented 7 years ago

With the julia implementation above, there's no calls to libm from within oneStep, as I expressed the power as norm(q)^3 (as in the comment on your blog post). Here's the llvm IR:

julia> @code_llvm oneStep(0.1, SVector(0.1,0.1), SVector(0.2,0.2))

define void @julia_oneStep_61006([2 x %SArray]* noalias nocapture sret, double, %SArray* nocapture readonly dereferenceable(16), %SArray* nocapture readonly dereferenceable(16)) #0 !dbg !5 {
pass11:
  %4 = getelementptr inbounds %SArray, %SArray* %2, i64 0, i32 0, i64 0
  %5 = getelementptr inbounds %SArray, %SArray* %2, i64 0, i32 0, i64 1
  %6 = load double, double* %4, align 8
  %7 = load double, double* %5, align 8
  %8 = fmul double %6, %6
  %9 = fmul double %7, %7
  %10 = fadd double %8, %9
  %11 = call double @llvm.sqrt.f64(double %10)
  %12 = fmul double %11, %11
  %13 = fmul double %11, %12
  %14 = fdiv double %6, %13
  %15 = fdiv double %7, %13
  %16 = fmul double %1, 5.000000e-01
  %17 = fmul double %16, %14
  %18 = fmul double %16, %15
  %19 = getelementptr inbounds %SArray, %SArray* %3, i64 0, i32 0, i64 0
  %20 = load double, double* %19, align 8
  %21 = fsub double %20, %17
  %22 = getelementptr inbounds %SArray, %SArray* %3, i64 0, i32 0, i64 1
  %23 = load double, double* %22, align 8
  %24 = fsub double %23, %18
  %25 = fmul double %21, %1
  %26 = fmul double %24, %1
  %27 = fadd double %6, %25
  %28 = fadd double %7, %26
  %29 = fmul double %27, %27
  %30 = fmul double %28, %28
  %31 = fadd double %29, %30
  %32 = call double @llvm.sqrt.f64(double %31)
  %33 = fmul double %32, %32
  %34 = fmul double %32, %33
  %35 = fdiv double %27, %34
  %36 = fdiv double %28, %34
  %37 = fmul double %16, %35
  %38 = fmul double %16, %36
  %39 = fsub double %21, %37
  %40 = fsub double %24, %38
  %41 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 0, i32 0, i64 0
  store double %27, double* %41, align 8
  %.sroa.235.0..sroa_idx36 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 0, i32 0, i64 1
  store double %28, double* %.sroa.235.0..sroa_idx36, align 8
  %.sroa.3.0..sroa_idx37 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 1, i32 0, i64 0
  store double %39, double* %.sroa.3.0..sroa_idx37, align 8
  %.sroa.4.0..sroa_idx38 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 1, i32 0, i64 1
  store double %40, double* %.sroa.4.0..sroa_idx38, align 8
  ret void
}

The @llvm.sqrt.f64 intrinsic gets compiled down to to a call to sqrtsd on my local machine. In the comments section of your post it was reported that this helped in both cases, as it should, but the 2x difference remained. Which seemed confusing and worth thinking about :-)

andyferris commented 7 years ago

Yes, this is extremely interesting. I did read the blog post and was very curious. (Sorry I don't have anything intelligent to say at the moment! But thanks Chris for taking a stab).

tkoolen commented 7 years ago

SIMD? With -O3 --check-bounds=no (the latter of which doesn't seem to matter) I get

define void @julia_oneStep_61312([2 x %SArray]* noalias nocapture sret, double, %SArray* nocapture readonly dereferenceable(16), %SArray* nocapture readonly dereferenceable(16)) #0 !dbg !5 {
pass15:
  %4 = bitcast %SArray* %2 to <2 x double>*
  %5 = load <2 x double>, <2 x double>* %4, align 8
  %6 = extractelement <2 x double> %5, i32 0
  %7 = fmul double %6, %6
  %8 = extractelement <2 x double> %5, i32 1
  %9 = fmul double %8, %8
  %10 = fadd double %7, %9
  %11 = call double @llvm.sqrt.f64(double %10)
  %12 = fmul double %11, %11
  %13 = fmul double %11, %12
  %14 = insertelement <2 x double> undef, double %13, i32 0
  %15 = insertelement <2 x double> %14, double %13, i32 1
  %16 = fdiv <2 x double> %5, %15
  %17 = fmul double %1, 5.000000e-01
  %18 = insertelement <2 x double> undef, double %17, i32 0
  %19 = insertelement <2 x double> %18, double %17, i32 1
  %20 = fmul <2 x double> %19, %16
  %21 = bitcast %SArray* %3 to <2 x double>*
  %22 = load <2 x double>, <2 x double>* %21, align 8
  %23 = fsub <2 x double> %22, %20
  %24 = insertelement <2 x double> undef, double %1, i32 0
  %25 = insertelement <2 x double> %24, double %1, i32 1
  %26 = fmul <2 x double> %25, %23
  %27 = fadd <2 x double> %5, %26
  %28 = extractelement <2 x double> %27, i32 0
  %29 = fmul double %28, %28
  %30 = extractelement <2 x double> %27, i32 1
  %31 = fmul double %30, %30
  %32 = fadd double %29, %31
  %33 = call double @llvm.sqrt.f64(double %32)
  %34 = fmul double %33, %33
  %35 = fmul double %33, %34
  %36 = insertelement <2 x double> undef, double %35, i32 0
  %37 = insertelement <2 x double> %36, double %35, i32 1
  %38 = fdiv <2 x double> %27, %37
  %39 = fmul <2 x double> %19, %38
  %40 = fsub <2 x double> %23, %39
  %41 = bitcast [2 x %SArray]* %0 to <2 x double>*
  store <2 x double> %27, <2 x double>* %41, align 8
  %.sroa.3.0..sroa_idx41 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 1, i32 0, i64 0
  %42 = bitcast double* %.sroa.3.0..sroa_idx41 to <2 x double>*
  store <2 x double> %40, <2 x double>* %42, align 8
  ret void
}

The wordpress post says 'Julia doesn’t use SIMD by default. We can change this by using -O3. In the event (I don’t reproduce it here), this makes very little difference to performance.' For me, it does make quite a significant difference:

@benchmark oneStep(h, q, p) setup = (h = 0.1; q = SVector(0.1,0.1); p = SVector(0.2,0.2))

results in 24.3 ns without -O3 and 15.7 ns with -O3.

tkoolen commented 7 years ago

In https://idontgetoutmuch.wordpress.com/2017/06/02/1090/#comment-1153, he gets a ratio between 'real' times of 5.136 / 3.116 = 1.65 (not sure if 'real' is the right thing to look at here). The SIMD difference above is a ratio of 24.3 / 15.7 = 1.55, so almost there. Given the fact that he used time, It's likely that he's still counting compilation time for Julia.

Alternatively, the difference between user times (if, instead, those are the right thing to look at and he did use -O3 for Julia) is almost exactly the 1.56s constant term from his linear regression (compilation time).

Maybe I should stop guessing and actually install Haskell :-)

idontgetoutmuch commented 7 years ago

Do you think the version of Julia makes a difference? I am on

 Version 0.5.2 (2017-05-06 16:34 UTC)
 Official http://julialang.org/ release
x86_64-apple-darwin13.4.0

I did think SIMD should have roughly doubled performance.

tkoolen commented 7 years ago

I'm getting the exact same LLVM IR and benchmark results on 0.5.2 and 0.6.

idontgetoutmuch commented 7 years ago

Thank you for doing this. I can't explain my results in that case. I will add a comment to the post pointing to this thread.

jebej commented 7 years ago

I was curious, so I also ran the benchmark above. I also get a ~1.6x speedup in -O3: 14.6ns vs 23.5ns.