Lowering of `NTuple{4, VecElement{UInt8}}`

vchuravy commented 8 years ago

I was looking into getting Julia to emit the LLVM IR for a truncation between vector types.

%1 = trunc <4 x i16> %0 to <4 x i8>

which should correspond to the Julia code:

import Core.Intrinsics: box, unbox, VecElement, trunc_int
typealias VUInt16{N} NTuple{N, VecElement{UInt16}}
typealias VUInt8{N} NTuple{N, VecElement{UInt8}}

trunc(x::VUInt16{4}) = box(VUInt8{4}, trunc_int(VUInt8{4}, unbox(VUInt16{N}, x)))

julia> trunc(VUInt16((1,2,3,4)))
------------------------------------------------------------------------------------------
ErrorException                                          Stacktrace (most recent call last)
[#1] — trunc(::NTuple{4,VecElement{UInt16}})
       ⌙ at REPL[4]:1

expected bits type as first argument

I traced it down to staticeval_bitstype https://github.com/JuliaLang/julia/blob/fa4c02cb51cdb7eec7a4cc10ec0ed7d6a1e2c3ff/src/intrinsics.cpp#L388 and it seems that jl_is_bitstype is false for Tuples of VecElement (because nfields != 0).

So my question is should jl_is_bitstype be true for NTuple{N, VecElement}? Since they lower to the vector types in llvm that seems not unreasonable, or should we special case it for staticeval_bitstype?

yuyichao commented 8 years ago

So my question is should jl_is_bitstype be true for NTuple{N, VecElement}?

bitstype is a julia concept unrelated to what LLVM code we emit. I think it's okay to add NTuple{N,VecElement} as a special case to places that is useful for hand written SIMD code.

eschnett commented 8 years ago

You can use the SIMD package to do this:

using SIMD
f(x) = x % Vec{4,Int8}
@code_llvm f(Vec{4,Int16}(1))

Unfortunately, SIMD currently scalarizes the code. Please open a bug report (or pull request...) if you are interested in improving this.

vchuravy commented 8 years ago

My goal is to give us a bit better support for SIMD in base Julia and to reduce the reliance on llvmcall for SIMD.jl.

On Sun, 11 Sep 2016, 11:06 Erik Schnetter, notifications@github.com wrote:

You can use the SIMD https://github.com/eschnett/SIMD.jl package to do this:

using SIMDf(x) = x % Vec{4,Int8}@code_llvm f(Vec{4,Int16}(1))

Unfortunately, SIMD currently scalarizes the code. Please open a bug report (or pull request...) if you are interested in improving this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaLang/julia/issues/18445#issuecomment-246185055, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI3arTqBdnc24MBxxvBTa266yM2nzrTks5qpBijgaJpZM4J6A3B .

vtjnash commented 8 years ago

How onerous would it be to make separate vector versions of all of the relevant Intrinsics? I'm not completely opposed to making the existing intrinsics accept both bitstype or NTuple{VecElement{bitstype}}, but it seems hard to update all of the code (esp. the C implementations we have for all of the intrinsics) to accept them.

eschnett commented 8 years ago

@vtjnash Isn't that exactly what the SIMD package already does? Before I implemented it, there was a discussion, and the conclusion was that new intrinsics were not necessary, and using llvmcall was easier, in particular since it didn't require changes to Base.

If there is a reason why intrinsics are to be preferred, e.g. if they are faster, then we should go that route. In this case, it would make sense to move the SIMD package into Base as well. Maybe we should move SIMD into Base first, and then work on intrinsics as performance improvement?

vchuravy commented 8 years ago

As far as I know the string from of llvmcall is something that is seen as a hack and should be removed in the long-term. One of its obvious limitations is that LLVM IR is not stable between LLVM versions. (also we can't really work with Julia struct...)

I personally think that having a great support story for SIMD types in Julia is important and it should have full language support. As you can see in #18470 the changes to base are not as bad as I thought (except the runtime-intrinsics) and Core.Intrinsics should give us enough basic support to implement most of SIMD.jl

eschnett commented 8 years ago

The big unknown issue with SIMD operations is how to represent vectors of booleans. Julia represents Bool essentially as UInt8 where values are restricted to 0 or 1. This is not efficient for SIMD operations. LLVM will often have to resort to scalarizing code, and thus all code using ifelse or similar constructs will run very slowly.

The solution adopted by OpenCL (and many other packages) is to have several boolean types, essentially one boolean type for each integer or floating point size: Bool8, Bool16, Bool32, Bool64, with respective conversion operations. Also, it is common to use the values 0 and -1 to represent false and true, and sometimes also to interpret all negative values as true when arbitrary integers are presented. This corresponds to hardware instructions on a wide range of architectures.

I'm bringing this up here since this is one of the big missing pieces in the SIMD package, and given that this is a large-ish change, it also deserves some discussion before going into Base Julia. However, given the state of Julia's and LLVM's optimizers, I don't see a way around introducing multiple boolean types if one wants to write efficient predicated SIMD code.

vchuravy commented 8 years ago

The select instruction in LLVM is <N x i1> http://llvm.org/docs/LangRef.html#select-instruction How does that mesh with the hardware side? Supporting i1 for masked load and select is on my list of things to add.

eschnett commented 8 years ago

<N x i1> is only the representation for the LLVM instruction. E.g. Intel AVX2 expects <N x i64> for booleans if the other arguments to the select instruction are of type i64. (The rule of thumb for most architectures is that the number of bits in the boolean and the other arguments needs to be the same.)

If you store booleans in <N x i64>, and then truncate to <N x i1> right before the LLVM select instruction, LLVM will generate efficient code that skips the truncation. If, however, you store booleans in <N x i1> or <N x i8> when they are passed into a function, then LLVM will have to go to great lengths to convert to <N x i64> as expected by the instruction.

In SIMD.jl, I never store booleans in <N x i1>; instead, I only truncate to this type just before LLVM's select intrinsic.

vchuravy commented 8 years ago

Thanks for the explanation, that explains why I was seeing scalarized assembly for extracting a bitvector.

JuliaLang / julia

Lowering of `NTuple{4, VecElement{UInt8}}` #18445