Closed Chiil closed 3 years ago
Bad performance: Hitting a bunch of non-inlined fallback methods because of the rationals.
Will basically have to redefine all of these fallback methods and mark them @inline
...
Julia's inliner thinks llvmcall
is very expensive, so it's unlikely to choose to inline a function using it.
This forces basically all code using llvmcall
or calling functions using llvmcall
to manually add @inline
if the method is supposed to be inlined.
Deadlock: Not sure yet, but things seem to be getting corrupted.
Deadlock: Not sure yet, but things seem to be getting corrupted.
Memory is getting corrupted because it blew ThreadingUtilitie's buffers (corrupting its state).
These bluffers were blown because LoopVectorization tried to pass 170 Rational{Int64}
s as function arguments.
Float64
s wouldn't be passed as arguments at all (but inserted directly) and also wouldn't have caused the non-inlining bad performance problem.
So, I'm guessing that you weren't using Rational
s before, and both the deadloack and bad performance started when you began using Rational
s?
A fix in LoopVectorization would be to check for them when handling constant literals, and automatically convert.
I switched to Rational
because of the reason explained in a previous problem I reported in an earlier issue: https://github.com/JuliaSIMD/LoopVectorization.jl/issues/320
I have code that can run in both Float32
and Float64
and by using rationals I could (according to the Julia manual) avoid having to cast the literal constant from Float64
to Float32
. I followed the advice at the bottom of this page: https://docs.julialang.org/en/v1/manual/style-guide/
Once I release LoopVectorization 0.12.75 (which should be in the next few hours), the examples of using rationals here will be fine.
LoopVectorization should be able to convert Float64
to Float32
if it needs to, so that using Float64
won't cause a problem.
julia> using VectorizationBase
julia> vx32 = Vec(ntuple(_ -> randn(Float32), pick_vector_width(Float32))...)
Vec{16, Float32}<-1.2874495f0, -0.5067953f0, 0.31757203f0, 1.067527f0, 0.3329838f0, 0.28481305f0, 0.49678314f0, -0.010635474f0, 0.5358165f0, 0.70434624f0, -1.0242392f0, -2.159845f0, 0.041450497f0, 1.5327083f0, -0.14729454f0, 0.3033066f0>
julia> vx32 * 34. # scalar `Float64` gets demoted to `Float32` when used with vector of `Float32`s.
Vec{16, Float32}<-43.77328f0, -17.23104f0, 10.797449f0, 36.29592f0, 11.321449f0, 9.683643f0, 16.890627f0, -0.36160612f0, 18.21776f0, 23.947773f0, -34.82413f0, -73.43473f0, 1.4093169f0, 52.112083f0, -5.008014f0, 10.312425f0>
julia> -1.287 * 34.
-43.757999999999996
Merging the lines above
@tturbo for k in ks:ke
for j in js:je
for i in is:ie
@fd (ut, u, v, w) ut += (
- gradx(interpx(u) * interpx(u)) + visc * (gradx(gradx(u))) )
@fd (ut, u, v, w) ut += (
- grady(interpx(v) * interpy(u)) + visc * (grady(grady(u))) )
@fd (ut, u, v, w) ut += (
- gradz(interpx(w) * interpz(u)) + visc * (gradz(gradz(u))) )
end
end
end
into
@tturbo for k in ks:ke
for j in js:je
for i in is:ie
@fd (ut, u, v, w) ut += (
- gradx(interpx(u) * interpx(u)) + visc * (gradx(gradx(u)))
- grady(interpx(v) * interpy(u)) + visc * (grady(grady(u)))
- gradz(interpx(w) * interpz(u)) + visc * (gradz(gradz(u))) )
end
end
end
Results in a deadlock on 4 threads again in the most recent version.
Are you sure it's deadlocking? I tried and it worked, BUT it did take obscenely long to compile:
julia> @time kernel!(
ut, u, v, w,
visc, dxi, dyi, dzi, dt,
is, ie, js, je, ks, ke)
151.048107 seconds (172.85 M allocations: 7.728 GiB, 11.76% gc time, 99.73% compilation time)
The reason is that for some reason LV decided to do something wonky to optimize the second vs the former. EDIT: It was deciding to do something wonky because of a bug.
I reran an old benchmark that used to work with LoopVectorization, but that gives very slow performance on one thread and deadlocks on four threads. I filed multiple bugs on earlier versions of this before, and everything at one point worked, but now I am back at a non-working version of my benchmark. This is the code. I have also a slightly modified version in which I put the rational constants in global constants and then it works fine.