Open vchuravy opened 8 years ago
So my question is should
jl_is_bitstype
be true forNTuple{N, VecElement}
?
bitstype
is a julia concept unrelated to what LLVM code we emit. I think it's okay to add NTuple{N,VecElement}
as a special case to places that is useful for hand written SIMD code.
You can use the SIMD package to do this:
using SIMD
f(x) = x % Vec{4,Int8}
@code_llvm f(Vec{4,Int16}(1))
Unfortunately, SIMD currently scalarizes the code. Please open a bug report (or pull request...) if you are interested in improving this.
My goal is to give us a bit better support for SIMD in base Julia and to reduce the reliance on llvmcall for SIMD.jl.
On Sun, 11 Sep 2016, 11:06 Erik Schnetter, notifications@github.com wrote:
You can use the SIMD https://github.com/eschnett/SIMD.jl package to do this:
using SIMDf(x) = x % Vec{4,Int8}@code_llvm f(Vec{4,Int16}(1))
Unfortunately, SIMD currently scalarizes the code. Please open a bug report (or pull request...) if you are interested in improving this.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaLang/julia/issues/18445#issuecomment-246185055, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI3arTqBdnc24MBxxvBTa266yM2nzrTks5qpBijgaJpZM4J6A3B .
How onerous would it be to make separate vector versions of all of the relevant Intrinsics? I'm not completely opposed to making the existing intrinsics accept both bitstype
or NTuple{VecElement{bitstype}}
, but it seems hard to update all of the code (esp. the C implementations we have for all of the intrinsics) to accept them.
@vtjnash Isn't that exactly what the SIMD package already does? Before I implemented it, there was a discussion, and the conclusion was that new intrinsics were not necessary, and using llvmcall
was easier, in particular since it didn't require changes to Base
.
If there is a reason why intrinsics are to be preferred, e.g. if they are faster, then we should go that route. In this case, it would make sense to move the SIMD package into Base
as well. Maybe we should move SIMD into Base
first, and then work on intrinsics as performance improvement?
As far as I know the string from of llvmcall
is something that is seen as a hack and should be removed in the long-term. One of its obvious limitations is that LLVM IR is not stable between LLVM versions. (also we can't really work with Julia struct...)
I personally think that having a great support story for SIMD types in Julia is important and it should have full language support. As you can see in #18470 the changes to base are not as bad as I thought (except the runtime-intrinsics) and Core.Intrinsics
should give us enough basic support to implement most of SIMD.jl
The big unknown issue with SIMD operations is how to represent vectors of booleans. Julia represents Bool
essentially as UInt8
where values are restricted to 0
or 1
. This is not efficient for SIMD operations. LLVM will often have to resort to scalarizing code, and thus all code using ifelse
or similar constructs will run very slowly.
The solution adopted by OpenCL (and many other packages) is to have several boolean types, essentially one boolean type for each integer or floating point size: Bool8
, Bool16
, Bool32
, Bool64
, with respective conversion operations. Also, it is common to use the values 0
and -1
to represent false and true, and sometimes also to interpret all negative values as true when arbitrary integers are presented. This corresponds to hardware instructions on a wide range of architectures.
I'm bringing this up here since this is one of the big missing pieces in the SIMD package, and given that this is a large-ish change, it also deserves some discussion before going into Base Julia. However, given the state of Julia's and LLVM's optimizers, I don't see a way around introducing multiple boolean types if one wants to write efficient predicated SIMD code.
The select instruction in LLVM is <N x i1>
http://llvm.org/docs/LangRef.html#select-instruction
How does that mesh with the hardware side? Supporting i1
for masked load and select is on my list of things to add.
<N x i1>
is only the representation for the LLVM instruction. E.g. Intel AVX2 expects <N x i64>
for booleans if the other arguments to the select instruction are of type i64
. (The rule of thumb for most architectures is that the number of bits in the boolean and the other arguments needs to be the same.)
If you store booleans in <N x i64>
, and then truncate to <N x i1>
right before the LLVM select instruction, LLVM will generate efficient code that skips the truncation. If, however, you store booleans in <N x i1>
or <N x i8>
when they are passed into a function, then LLVM will have to go to great lengths to convert to <N x i64>
as expected by the instruction.
In SIMD.jl
, I never store booleans in <N x i1>
; instead, I only truncate to this type just before LLVM's select intrinsic.
Thanks for the explanation, that explains why I was seeing scalarized assembly for extracting a bitvector.
I was looking into getting Julia to emit the LLVM IR for a truncation between vector types.
which should correspond to the Julia code:
I traced it down to
staticeval_bitstype
https://github.com/JuliaLang/julia/blob/fa4c02cb51cdb7eec7a4cc10ec0ed7d6a1e2c3ff/src/intrinsics.cpp#L388 and it seems thatjl_is_bitstype
is false for Tuples of VecElement (because nfields != 0).So my question is should
jl_is_bitstype
be true forNTuple{N, VecElement}
? Since they lower to the vector types in llvm that seems not unreasonable, or should we special case it forstaticeval_bitstype
?