Vectorization of field access

cdsousa commented 6 years ago

The code

using CUDAdrv, CUDAnative

k(a,b) = (@inbounds a[1] = b[1]; nothing)

t = CuArray([(0x0,0x0)])
@device_code_sass @cuda k(t, t)

, as well as

struct AAA; x::UInt8; y::UInt8; end
s = CuArray([AAA(0x0,0x0)])
@device_code_sass @cuda k(s,s)

, generate two load and two store instructions (independently of the data type used (int, float, etc.)):

...
        /*0028*/                   LDG.E.U8 R7, [R2+0x1];         /* 0xeed0200000170207 */
        /*0030*/                   LDG.E.U8 R6, [R2];             /* 0xeed0200000070206 */
...
        /*0058*/                   STG.E.U8 [R4], R7;             /* 0xeed8200000070407 */
                                                                  /* 0x001ffc00ffe081f1 */
        /*0068*/                   STG.E.U8 [R4+-0x1], R6;        /* 0xeed82ffffff70406 */
        ...

It would be amazing if the compiler could optimize that to use vectorized memory access in both cases (https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ update: this link is not totally about what I'm referring to, see comments below).

cdsousa commented 6 years ago

Let me add that I don't know if this is a do-able feature (or even if it is a desired one). However, if I remember correctly, Transpiler.jl for OpenCL has it.

maleadt commented 6 years ago

This optimization is impossible, since it affects the way you execute the kernel. That is, if you were to perform a vectorized load/store, you'd need to launch less threads because each threads processes multiple items.

So either way, with our programming model you'd need to express this explicitly. For example, launching half as many threads of the following kernel:

k(a,b) = (@inbounds a[1] = b[1]; @inbounds a[2] = b[2]; nothing)

But this doesn't vectorize either:

        /*0048*/                   ST.E.U8 [R4+0x1], R0;   /* 0xe0800000009c1000 */
        /*0050*/                   ST.E.U8 [R4], R3;       /* 0xe0800000001c100c */
        /*0058*/                   LD.E.U8 R2, [R6+0x3];   /* 0xc0800000019c1808 */
        /*0060*/                   LD.E.U8 R0, [R6+0x2];   /* 0xc0800000011c1800 */
        /*0068*/                   ST.E.U8 [R4+0x3], R2;   /* 0xe0800000019c1008 */
        /*0070*/                   ST.E.U8 [R4+0x2], R0;   /* 0xe0800000011c1000 */

And that is something valid to look at.

maleadt commented 6 years ago

LLVM has enable the load/store vectorizer for NVPTX in D22592 (4.0), so lets juts revisit this when JuliaLang/julia#26398 lands.

cdsousa commented 6 years ago

@maleadt, I didn't fully understand your first comment.

I'm referring to load and stores of types (structs and ntuples) that are composed of smaller types that could be seen together as int2, int4, float4, etc. I'm not referring to do auto-vectorization of accesses of elements of arrays (in the style of SIMD...).

When one reads each element of an array of NTuple{4, Float32}s, one wants all the 4 Float32s, so the load could be possibly done as a CUDA's float4...

maleadt commented 6 years ago

Oh right, I thought you were referring to the optimization as explained in that blog post, which involves doing manual vector loads and changing the launch configuration. Should have read more carefully.

But the example I posted, doing multiple loads manually, corresponds to your examples loading from tuples or structs. So the conclusion remains: LLVM 6.0 should improve this.

cdsousa commented 6 years ago

Ok, my bad, indeed the blog post I linked is not about what I wanted.

maleadt commented 6 years ago

OK, some notes to self: the ld/st vectorizer seems to work, ~~but requires an alignment of >=4 (int8 loads are align 1 right now)~~.

source_filename = "wip"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @kernel(i8 addrspace(1)* nocapture nonnull dereferenceable(1)) local_unnamed_addr {
top:
  %1 = getelementptr i8, i8 addrspace(1)* %0, i64 1
  %2 = getelementptr i8, i8 addrspace(1)* %0, i64 2

  store i8 0, i8 addrspace(1)* %1, align 4
  store i8 0, i8 addrspace(1)* %2, align 4

  ret void
}

But the vectorizer doesn't seem to run with opt -O3, even though it should be added by NVPTX's addIRPasses. Maybe that's why it also didn't seem to vectorize with our pass config.

using CUDAnative

k(a) = (a[1]=0; a[2]=0; nothing)
CUDAnative.code_llvm(k, Tuple{CuDeviceArray{Int8,1,AS.Global}}; dump_module=true)

maleadt commented 6 years ago

Apparently the vectorizer also doesn't like AS casts:

source_filename = "wip"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @foo(i8*) {
entry:
  %1 = getelementptr i8, i8* %0, i64 1
  %2 = addrspacecast i8* %1 to i8 addrspace(1)*
  %3 = getelementptr i8, i8* %0, i64 2
  %4 = addrspacecast i8* %3 to i8 addrspace(1)*
  store i8 0, i8 addrspace(1)* %2, align 4
  store i8 0, i8 addrspace(1)* %4, align 4
  ret void
}

define void @bar(i8 addrspace(1)*) {
entry:
  %1 = getelementptr i8, i8 addrspace(1)* %0, i64 1
  %2 = getelementptr i8, i8 addrspace(1)* %0, i64 2
  store i8 0, i8 addrspace(1)* %1, align 4
  store i8 0, i8 addrspace(1)* %2, align 4
  ret void
}

maleadt commented 6 years ago

OK, so this turns out to be much more complicated than initially expected, because LLVM treads AS casts as black boxes. Patching the LSV wouldn't help, this is a deeper problem: SCEV not looking past AS casts. For example

define void @vectorizes(i8 addrspace(1)*) {
entry:
  %1 = getelementptr i8, i8 addrspace(1)* %0, i64 1
  store i8 0, i8 addrspace(1)* %1, align 4

  %2 = getelementptr i8, i8 addrspace(1)* %0, i64 2
  store i8 0, i8 addrspace(1)* %2, align 4

  ret void
}

; LSV: isConsecutiveAccess check for
;  store i8 0, i8 addrspace(1)* %1, align 4
;  store i8 0, i8 addrspace(1)* %2, align 4
; SCEV ptrA: (1 + %0)
; SCEV ptrB: (2 + %0)

define void @doesnt_vectorize(i8*) {
entry:
  %1 = getelementptr i8, i8* %0, i64 1
  %2 = addrspacecast i8* %1 to i8 addrspace(1)*
  store i8 0, i8 addrspace(1)* %2, align 4

  %3 = getelementptr i8, i8* %0, i64 2
  %4 = addrspacecast i8* %3 to i8 addrspace(1)*
  store i8 0, i8 addrspace(1)* %4, align 4

  ret void
}

; LSV: isConsecutiveAccess check
;  store i8 0, i8 addrspace(1)* %2, align 4
;  store i8 0, i8 addrspace(1)* %4, align 4
; SCEV ptrA: %2
; SCEV ptrB: %4

Highly related: https://reviews.llvm.org/D23749

@cdsousa: are you running into actual perf issues where vectorizing instructions would be beneficial? I'd imagine this to happen with compute-bound workloads, which tend to be pretty rare...

cdsousa commented 6 years ago

@maleadt , actually no, although not yet completely tested. I'm assessing how well a julia code for wrapping an RGB image translates to CUDAnative.jl, and how optimized it would get. I was looking for using a high-level RGB{N0f8} type (ColorTypes.jl, FixedPointNumbers.jl) and get all the optimizations for free...

SimonDanisch commented 6 years ago

Btw, I simply implemented this by overloading getindex to directly call the vector load/store intrinsic:

Base.getindex{T <: VecTypes}(a::GlobalPointer{T}, i::Integer) = vload(T, a, i)
function Base.setindex!{T <: VecTypes}(a::GlobalPointer{T}, value::T, i::Integer)
    vstore(value, a, i) # i also overloaded vstore/vload to call the correct <N x T> intrinsic
end

maleadt commented 6 years ago

Yeah, this is a more generic approach though, to vectorize arbitrary memory operations, including non-VecTypes structs, or even stack values.

@cdsousa it doesn't seem like clang nor nvcc are smart enough to optimize your examples either, which have the added difficulty that the alignment of individual memory operations is too small to enable vectorizing. I might have a look, but given the difficulty and the presumed low payoff this isn't a priority.

maleadt commented 6 years ago

OK, some final reflections with what would need to happen to make this a thing.

All examples to be executed with LSV+NVPTX, ie. opt -load-store-vectorizer - | llmv-dis -o - with

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

make LLVM's LSV peek through AS casts

define void @yes(i8 addrspace(1)*) {
entry:
  %1 = getelementptr i8, i8 addrspace(1)* %0, i64 1
  %2 = getelementptr i8, i8 addrspace(1)* %0, i64 2
  store i8 0, i8 addrspace(1)* %1, align 2
  store i8 0, i8 addrspace(1)* %2, align 1
  ret void
}

define void @no(i8*) {
entry:
  %1 = getelementptr i8, i8* %0, i64 1
  %2 = addrspacecast i8* %1 to i8 addrspace(1)*
  %3 = getelementptr i8, i8* %0, i64 2
  %4 = addrspacecast i8* %3 to i8 addrspace(1)*
  store i8 0, i8 addrspace(1)* %2, align 2
  store i8 0, i8 addrspace(1)* %4, align 1
  ret void
}

Ref https://reviews.llvm.org/D23749

make Julia's getfield emit better alignment

emit_getfield_knownidx already uses known datatype alignment, but jl_field_align seems pessimistic for aggregates:

julia> datatype_align(Int32)
0x0004

julia> datatype_align(Tuple{Int32,Int32})
0x0004

julia> struct Foo
       x::Int32
       y::Int32
       end

julia> datatype_align(Foo)
0x0004

julia> @code_llvm ((x)->(x[1]+x[2]))((Int32(1),Int32(2)))

; Function JuliaGPU/CUDAnative.jl#5
; Location: REPL[8]:1
define i32 @"julia_#5_35105"([2 x i32] addrspace(11)* nocapture nonnull readonly dereferenceable(8)) {
top:
; Function getindex; {
; Location: tuple.jl:22
  %1 = getelementptr [2 x i32], [2 x i32] addrspace(11)* %0, i64 0, i64 0
  %2 = getelementptr [2 x i32], [2 x i32] addrspace(11)* %0, i64 0, i64 1
;}
; Function +; {
; Location: int.jl:53
  %3 = load i32, i32 addrspace(11)* %1, align 4
  %4 = load i32, i32 addrspace(11)* %2, align 4
  %5 = add i32 %4, %3
;}
  ret i32 %5
}

I would have assumed these to be aligned to a multiple of the datatype size; this is also what the PTX ISA seems to require.

cdsousa commented 6 years ago

With the change of the issue title is tuple access still included?

maleadt commented 6 years ago

Yes.

maleadt commented 6 years ago

Actually, our datatype alignment isn't really pessimistic here, it's apparently valid (and mimics C) to align eg. a tuple of 2 1-byte values to a single byte (thanks @vchuravy). Which means that the transformation you proposed isn't valid, and automatic vectorization isn't possible in this case...

One way forward would be to redefine our alignment rules, to make automatic vectorization possible in cases like this. But that's a Julia issue; I'd rather not special-case alignment in CUDAnative (even though it would be legal for PTX). Maybe it'll happen in the wake of JuliaLang/julia#21959...

I'll leave this open because it's an interesting issue, but don't expect much to happen because as far as I am concerned it is blocked on an "upstream" issue.

cdsousa commented 4 years ago

For a proof of concept of explicit vectorization: https://github.com/JuliaGPU/CUDAnative.jl/issues/566#issuecomment-626219451

JuliaGPU / CUDA.jl

Vectorization of field access #69