Open c42f opened 7 years ago
I thought the most likely explanation is as outlined on my blog:
Now it is entirely possible that this results in usage of different libMs, the Julia calling openlibm and GHC calling the system libm which on my machine is the one that comes with MAC OS X and is apparently quite a lot faster. We could try compiling the actual llvm and replacing the Julia calls with pow but maybe that is the subject for another blog post.
http://info.prelert.com/blog/os-x-maths-functions-are-the-fastest
With the julia implementation above, there's no calls to libm from within oneStep
, as I expressed the power as norm(q)^3
(as in the comment on your blog post). Here's the llvm IR:
julia> @code_llvm oneStep(0.1, SVector(0.1,0.1), SVector(0.2,0.2))
define void @julia_oneStep_61006([2 x %SArray]* noalias nocapture sret, double, %SArray* nocapture readonly dereferenceable(16), %SArray* nocapture readonly dereferenceable(16)) #0 !dbg !5 {
pass11:
%4 = getelementptr inbounds %SArray, %SArray* %2, i64 0, i32 0, i64 0
%5 = getelementptr inbounds %SArray, %SArray* %2, i64 0, i32 0, i64 1
%6 = load double, double* %4, align 8
%7 = load double, double* %5, align 8
%8 = fmul double %6, %6
%9 = fmul double %7, %7
%10 = fadd double %8, %9
%11 = call double @llvm.sqrt.f64(double %10)
%12 = fmul double %11, %11
%13 = fmul double %11, %12
%14 = fdiv double %6, %13
%15 = fdiv double %7, %13
%16 = fmul double %1, 5.000000e-01
%17 = fmul double %16, %14
%18 = fmul double %16, %15
%19 = getelementptr inbounds %SArray, %SArray* %3, i64 0, i32 0, i64 0
%20 = load double, double* %19, align 8
%21 = fsub double %20, %17
%22 = getelementptr inbounds %SArray, %SArray* %3, i64 0, i32 0, i64 1
%23 = load double, double* %22, align 8
%24 = fsub double %23, %18
%25 = fmul double %21, %1
%26 = fmul double %24, %1
%27 = fadd double %6, %25
%28 = fadd double %7, %26
%29 = fmul double %27, %27
%30 = fmul double %28, %28
%31 = fadd double %29, %30
%32 = call double @llvm.sqrt.f64(double %31)
%33 = fmul double %32, %32
%34 = fmul double %32, %33
%35 = fdiv double %27, %34
%36 = fdiv double %28, %34
%37 = fmul double %16, %35
%38 = fmul double %16, %36
%39 = fsub double %21, %37
%40 = fsub double %24, %38
%41 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 0, i32 0, i64 0
store double %27, double* %41, align 8
%.sroa.235.0..sroa_idx36 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 0, i32 0, i64 1
store double %28, double* %.sroa.235.0..sroa_idx36, align 8
%.sroa.3.0..sroa_idx37 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 1, i32 0, i64 0
store double %39, double* %.sroa.3.0..sroa_idx37, align 8
%.sroa.4.0..sroa_idx38 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 1, i32 0, i64 1
store double %40, double* %.sroa.4.0..sroa_idx38, align 8
ret void
}
The @llvm.sqrt.f64
intrinsic gets compiled down to to a call to sqrtsd
on my local machine. In the comments section of your post it was reported that this helped in both cases, as it should, but the 2x difference remained. Which seemed confusing and worth thinking about :-)
Yes, this is extremely interesting. I did read the blog post and was very curious. (Sorry I don't have anything intelligent to say at the moment! But thanks Chris for taking a stab).
SIMD? With -O3 --check-bounds=no
(the latter of which doesn't seem to matter) I get
define void @julia_oneStep_61312([2 x %SArray]* noalias nocapture sret, double, %SArray* nocapture readonly dereferenceable(16), %SArray* nocapture readonly dereferenceable(16)) #0 !dbg !5 {
pass15:
%4 = bitcast %SArray* %2 to <2 x double>*
%5 = load <2 x double>, <2 x double>* %4, align 8
%6 = extractelement <2 x double> %5, i32 0
%7 = fmul double %6, %6
%8 = extractelement <2 x double> %5, i32 1
%9 = fmul double %8, %8
%10 = fadd double %7, %9
%11 = call double @llvm.sqrt.f64(double %10)
%12 = fmul double %11, %11
%13 = fmul double %11, %12
%14 = insertelement <2 x double> undef, double %13, i32 0
%15 = insertelement <2 x double> %14, double %13, i32 1
%16 = fdiv <2 x double> %5, %15
%17 = fmul double %1, 5.000000e-01
%18 = insertelement <2 x double> undef, double %17, i32 0
%19 = insertelement <2 x double> %18, double %17, i32 1
%20 = fmul <2 x double> %19, %16
%21 = bitcast %SArray* %3 to <2 x double>*
%22 = load <2 x double>, <2 x double>* %21, align 8
%23 = fsub <2 x double> %22, %20
%24 = insertelement <2 x double> undef, double %1, i32 0
%25 = insertelement <2 x double> %24, double %1, i32 1
%26 = fmul <2 x double> %25, %23
%27 = fadd <2 x double> %5, %26
%28 = extractelement <2 x double> %27, i32 0
%29 = fmul double %28, %28
%30 = extractelement <2 x double> %27, i32 1
%31 = fmul double %30, %30
%32 = fadd double %29, %31
%33 = call double @llvm.sqrt.f64(double %32)
%34 = fmul double %33, %33
%35 = fmul double %33, %34
%36 = insertelement <2 x double> undef, double %35, i32 0
%37 = insertelement <2 x double> %36, double %35, i32 1
%38 = fdiv <2 x double> %27, %37
%39 = fmul <2 x double> %19, %38
%40 = fsub <2 x double> %23, %39
%41 = bitcast [2 x %SArray]* %0 to <2 x double>*
store <2 x double> %27, <2 x double>* %41, align 8
%.sroa.3.0..sroa_idx41 = getelementptr inbounds [2 x %SArray], [2 x %SArray]* %0, i64 0, i64 1, i32 0, i64 0
%42 = bitcast double* %.sroa.3.0..sroa_idx41 to <2 x double>*
store <2 x double> %40, <2 x double>* %42, align 8
ret void
}
The wordpress post says 'Julia doesn’t use SIMD by default. We can change this by using -O3. In the event (I don’t reproduce it here), this makes very little difference to performance.' For me, it does make quite a significant difference:
@benchmark oneStep(h, q, p) setup = (h = 0.1; q = SVector(0.1,0.1); p = SVector(0.2,0.2))
results in 24.3 ns without -O3 and 15.7 ns with -O3.
In https://idontgetoutmuch.wordpress.com/2017/06/02/1090/#comment-1153, he gets a ratio between 'real' times of 5.136 / 3.116 = 1.65 (not sure if 'real' is the right thing to look at here). The SIMD difference above is a ratio of 24.3 / 15.7 = 1.55, so almost there. Given the fact that he used time
, It's likely that he's still counting compilation time for Julia.
Alternatively, the difference between user times (if, instead, those are the right thing to look at and he did use -O3 for Julia) is almost exactly the 1.56s constant term from his linear regression (compilation time).
Maybe I should stop guessing and actually install Haskell :-)
Do you think the version of Julia makes a difference? I am on
Version 0.5.2 (2017-05-06 16:34 UTC)
Official http://julialang.org/ release
x86_64-apple-darwin13.4.0
I did think SIMD should have roughly doubled performance.
I'm getting the exact same LLVM IR and benchmark results on 0.5.2 and 0.6.
Thank you for doing this. I can't explain my results in that case. I will add a comment to the post pointing to this thread.
I was curious, so I also ran the benchmark above. I also get a ~1.6x speedup in -O3: 14.6ns vs 23.5ns.
As mentioned in passing on julialang slack, the following is an interesting and carefully considered blog post by @idontgetoutmuch, mainly about Haskell, but with an addendum comparing the speed of the resulting code to a julia implementation using StaticArrays:
https://idontgetoutmuch.wordpress.com/2017/06/02/1090/
It's a lot of LLVM assembly, but it's intriguing that there's possibly still a 2x speedup or so that we might be missing out on even for the following simple code (my interpretation of julia @idontgetoutmuch's implementation in julia idiom, possibly with added bugs):
I tried a lot of tweaks to get the 2x speedup, including running with
julia -O3 --math-mode=fast
, and writing theoneStep()
function in vectorized vs non-vectorized LLVM assembly by hand. None of these made things appreciably faster than just using the version above withjulia-0.6
. I'd probably need to actually benchmark the Haskell version locally to make any more progress...