Open chriselrod opened 3 years ago
The first part is a LLVM issue and should be discussed there. I feel like it's illegal though. And the biggest obstacle to represent scalable vector in julia code is that the side is unknown. There are a lot of places that assumes the size of a type is known at compile time or it'll be heap allocated. That's the main reason I did not implement it 4-5 years ago.
The first part is a LLVM issue and should be discussed there. I feel like it's illegal though.
Would the llvm/dev mailing list be the best place to ask?
And the biggest obstacle to represent scalable vector in julia code is that the side is unknown. There are a lot of places that assumes the size of a type is known at compile time or it'll be heap allocated. That's the main reason I did not implement it 4-5 years ago.
Yeah, I think it'd need special handling / a representation like it has in LLVM IR.
https://lists.llvm.org/pipermail/llvm-dev/2021-April/149612.html
suggested specifying -aarch64-sve-vector-bits-min=
, which does work:
https://godbolt.org/z/Mo76oWanW
Aside from the option not yet existing in LLVM 11, we'd need a way to set this correctly (and preferably automatically) when starting Julia.
This sounds (again) like something best handled on the LLVM side of things, e.g. a -aarch64-sve-vector-bits=native
option. Although were something like that to be added, it wouldn't be available until at least LLVM 13.
After playing a bit with Chris on an a64fx cluster with Julia v1.7-beta3 (which comes with LLVM 12):
$ julia -q
julia> using BenchmarkTools
julia> function sumsimd(x)
s = zero(eltype(x))
@simd for xi in x
s += xi
end
s
end
sumsimd (generic function with 1 method)
julia> @btime sumsimd(x) setup=(x = rand(1_000_000))
643.256 μs (0 allocations: 0 bytes)
500273.11451950937
julia>
$ JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512" julia -q
julia> using BenchmarkTools
julia> function sumsimd(x)
s = zero(eltype(x))
@simd for xi in x
s += xi
end
s
end
sumsimd (generic function with 1 method)
julia> @btime sumsimd(x) setup=(x = rand(1_000_000))
185.212 μs (0 allocations: 0 bytes)
500240.6910755522
It's sufficient to set JULIA_LLVM_ARGS
appropriately to get a nice 3.5x boost.
@giordano Have you also checked what happens if you use Float32
or Float16
? i.e.
julia> @btime sumsimd(x) setup=(x = rand(Float32,1_000_000)) julia> @btime sumsimd(x) setup=(x = rand(Float16,1_000_000))
whereas for the latter I assume Float16
must be enabled (#40216) which I assume is only for 1.8 onwards default
Yop, I did in https://github.com/UoB-HPC/BabelStream/pull/106#discussion_r697861796:
$ JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512" julia -q
julia> using BenchmarkTools
julia> function sumsimd(x)
s = zero(eltype(x))
@simd for xi in x
s += xi
end
s
end
sumsimd (generic function with 1 method)
julia> @btime sumsimd(x) setup=(x = randn(Float64, 1_000_000))
191.912 μs (0 allocations: 0 bytes)
1853.0335322487956
julia> @btime sumsimd(x) setup=(x = randn(Float32, 1_000_000))
80.330 μs (0 allocations: 0 bytes)
400.9806f0
julia> @btime sumsimd(x) setup=(x = randn(Float16, 1_000_000))
42.761 μs (0 allocations: 0 bytes)
Float16(1.872e3)
@chriselrod
However, early tests with using LLVM vector intrinsics on the A64FX did not go well. Here is a minimal example on Godbolt, showing a vectorized (but not unrolled) dot product on the A64FX, which has 512 bit vectors. The problem is that
<8 x double>
gets translated into 4x<2 x double>
NEON instructions, instead of an SVE instruction.v
registers are NEON, and see see that the single@llvm.fma.v8f64
was broken up into 4 separatefmla
instructions. Based on this document, SVE registers would be denoted byz[0-31]
.
Using llc
trunk (but already v13 should be enough) with -march=aarch64 -mcpu=a64fx -aarch64-sve-vector-bits-min=512
: https://godbolt.org/z/ovxhr933G. I think it looks much better?
.LBB0_2: // %L34
ld1d { z1.d }, p0/z, [x13]
ld1d { z2.d }, p0/z, [x14]
add x10, x10, #8
add x14, x14, #64
add x13, x13, #64
fmla z0.d, p0/m, z1.d, z2.d
cmp x10, x12
b.le .LBB0_2
Yeah, that looks good.
The good news is that Julia nightly
Julia Version 1.9.0-DEV.809
Commit 9b83dd8920 (2022-06-19 19:31 UTC)
Platform Info:
OS: Linux (aarch64-unknown-linux-gnu)
CPU: 48 × unknown
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.3 (ORCJIT, a64fx)
Threads: 1 on 48 virtual cores
generates the code (see below) for the sumsimd
function above with the new registries without having to use JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512"
(option which is a massive pain because you'd run into all sorts of crashes every now and then, like #44401 and #44263).
julia> @code_llvm debuginfo=:none sumsimd(randn(Float64, 1_000_000))
define double @julia_sumsimd_827({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
%.not = icmp eq i64 %3, 0
br i1 %.not, label %L17, label %L10.lr.ph
L10.lr.ph: ; preds = %top
%4 = bitcast {}* %0 to double**
%5 = load double*, double** %4, align 8
%6 = call i64 @llvm.vscale.i64()
%7 = shl i64 %6, 3
%min.iters.check = icmp ult i64 %3, %7
br i1 %min.iters.check, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %L10.lr.ph
%n.mod.vf = urem i64 %3, %7
%n.vec = sub nsw i64 %3, %n.mod.vf
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <vscale x 2 x double> [ insertelement (<vscale x 2 x double> zeroinitializer, double 0.000000e+00, i32 0), %vector.ph ], [ %23, %vector.body ]
%vec.phi9 = phi <vscale x 2 x double> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
%vec.phi10 = phi <vscale x 2 x double> [ zeroinitializer, %vector.ph ], [ %25, %vector.body ]
%vec.phi11 = phi <vscale x 2 x double> [ zeroinitializer, %vector.ph ], [ %26, %vector.body ]
%8 = getelementptr inbounds double, double* %5, i64 %index
%9 = bitcast double* %8 to <vscale x 2 x double>*
%wide.load = load <vscale x 2 x double>, <vscale x 2 x double>* %9, align 8
%10 = call i32 @llvm.vscale.i32()
%11 = shl i32 %10, 1
%12 = sext i32 %11 to i64
%13 = getelementptr inbounds double, double* %8, i64 %12
%14 = bitcast double* %13 to <vscale x 2 x double>*
%wide.load12 = load <vscale x 2 x double>, <vscale x 2 x double>* %14, align 8
%15 = shl i32 %10, 2
%16 = sext i32 %15 to i64
%17 = getelementptr inbounds double, double* %8, i64 %16
%18 = bitcast double* %17 to <vscale x 2 x double>*
%wide.load13 = load <vscale x 2 x double>, <vscale x 2 x double>* %18, align 8
%19 = mul i32 %10, 6
%20 = sext i32 %19 to i64
%21 = getelementptr inbounds double, double* %8, i64 %20
%22 = bitcast double* %21 to <vscale x 2 x double>*
%wide.load14 = load <vscale x 2 x double>, <vscale x 2 x double>* %22, align 8
%23 = fadd fast <vscale x 2 x double> %vec.phi, %wide.load
%24 = fadd fast <vscale x 2 x double> %vec.phi9, %wide.load12
%25 = fadd fast <vscale x 2 x double> %vec.phi10, %wide.load13
%26 = fadd fast <vscale x 2 x double> %vec.phi11, %wide.load14
%index.next = add nuw i64 %index, %7
%27 = icmp eq i64 %index.next, %n.vec
br i1 %27, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd fast <vscale x 2 x double> %24, %23
%bin.rdx15 = fadd fast <vscale x 2 x double> %25, %bin.rdx
%bin.rdx16 = fadd fast <vscale x 2 x double> %26, %bin.rdx15
%28 = call fast double @llvm.vector.reduce.fadd.nxv2f64(double -0.000000e+00, <vscale x 2 x double> %bin.rdx16)
%cmp.n = icmp eq i64 %n.mod.vf, 0
br i1 %cmp.n, label %L17, label %scalar.ph
scalar.ph: ; preds = %middle.block, %L10.lr.ph
%bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L10.lr.ph ]
%bc.merge.rdx = phi double [ %28, %middle.block ], [ 0.000000e+00, %L10.lr.ph ]
br label %L10
L10: ; preds = %L10, %scalar.ph
%value_phi18 = phi i64 [ %bc.resume.val, %scalar.ph ], [ %32, %L10 ]
%value_phi7 = phi double [ %bc.merge.rdx, %scalar.ph ], [ %31, %L10 ]
%29 = getelementptr inbounds double, double* %5, i64 %value_phi18
%30 = load double, double* %29, align 8
%31 = fadd fast double %value_phi7, %30
%32 = add nuw nsw i64 %value_phi18, 1
%exitcond.not = icmp eq i64 %32, %3
br i1 %exitcond.not, label %L17, label %L10
L17: ; preds = %L10, %middle.block, %top
%value_phi2 = phi double [ 0.000000e+00, %top ], [ %28, %middle.block ], [ %31, %L10 ]
ret double %value_phi2
}
julia> @code_native debuginfo=:none sumsimd(randn(Float64, 1_000_000))
.text
.file "sumsimd"
.globl julia_sumsimd_831 // -- Begin function julia_sumsimd_831
.p2align 3
.type julia_sumsimd_831,@function
julia_sumsimd_831: // @julia_sumsimd_831
.cfi_startproc
// %bb.0: // %top
ldr x8, [x0, #8]
cbz x8, .LBB0_3
// %bb.1: // %L10.lr.ph
ldr x9, [x0]
cnth x11
cmp x8, x11
b.hs .LBB0_4
// %bb.2:
mov x10, xzr
movi d0, #0000000000000000
b .LBB0_7
.LBB0_3:
movi d0, #0000000000000000
// kill: def $d0 killed $d0 killed $z0
ret
.LBB0_4: // %vector.ph
udiv x10, x8, x11
movi d1, #0000000000000000
mov z0.d, #0 // =0x0
ptrue p0.d, vl1
cntd x13
cntw x15
cntd x16, all, mul #3
mov x12, xzr
add x14, x9, w13, sxtw #3
add x15, x9, w15, sxtw #3
add x16, x9, w16, sxtw #3
sel z1.d, p0, z1.d, z0.d
mov z2.d, z0.d
mov z3.d, z0.d
ptrue p0.d
mul x10, x10, x11
sub x13, x8, x10
.p2align 2
.LBB0_5: // %vector.body
// =>This Inner Loop Header: Depth=1
ld1d { z4.d }, p0/z, [x9, x12, lsl #3]
ld1d { z5.d }, p0/z, [x14, x12, lsl #3]
fadd z1.d, z1.d, z4.d
ld1d { z6.d }, p0/z, [x15, x12, lsl #3]
ld1d { z7.d }, p0/z, [x16, x12, lsl #3]
fadd z0.d, z0.d, z5.d
fadd z2.d, z2.d, z6.d
fadd z3.d, z3.d, z7.d
add x12, x12, x11
cmp x12, x10
b.ne .LBB0_5
// %bb.6: // %middle.block
fadd z0.d, z0.d, z1.d
fadd z0.d, z2.d, z0.d
fadd z0.d, z3.d, z0.d
faddv d0, p0, z0.d
cbz x13, .LBB0_9
.LBB0_7: // %L10.preheader
sub x8, x8, x10
add x9, x9, x10, lsl #3
.p2align 2
.LBB0_8: // %L10
// =>This Inner Loop Header: Depth=1
ldr d1, [x9], #8
fadd d0, d0, d1
subs x8, x8, #1
b.ne .LBB0_8
.LBB0_9: // %L17
// kill: def $d0 killed $d0 killed $z0
ret
.Lfunc_end0:
.size julia_sumsimd_831, .Lfunc_end0-julia_sumsimd_831
.cfi_endproc
// -- End function
.section ".note.GNU-stack","",@progbits
julia> @code_llvm debuginfo=:none sumsimd(randn(Float16, 1_000_000))
define half @julia_sumsimd_840({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
%.not = icmp eq i64 %3, 0
br i1 %.not, label %L17, label %L10.lr.ph
L10.lr.ph: ; preds = %top
%4 = bitcast {}* %0 to half**
%5 = load half*, half** %4, align 8
%6 = call i64 @llvm.vscale.i64()
%7 = shl i64 %6, 5
%min.iters.check = icmp ult i64 %3, %7
br i1 %min.iters.check, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %L10.lr.ph
%n.mod.vf = urem i64 %3, %7
%n.vec = sub nsw i64 %3, %n.mod.vf
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <vscale x 8 x half> [ insertelement (<vscale x 8 x half> zeroinitializer, half 0xH0000, i32 0), %vector.ph ], [ %23, %vector.body ]
%vec.phi9 = phi <vscale x 8 x half> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
%vec.phi10 = phi <vscale x 8 x half> [ zeroinitializer, %vector.ph ], [ %25, %vector.body ]
%vec.phi11 = phi <vscale x 8 x half> [ zeroinitializer, %vector.ph ], [ %26, %vector.body ]
%8 = getelementptr inbounds half, half* %5, i64 %index
%9 = bitcast half* %8 to <vscale x 8 x half>*
%wide.load = load <vscale x 8 x half>, <vscale x 8 x half>* %9, align 2
%10 = call i32 @llvm.vscale.i32()
%11 = shl i32 %10, 3
%12 = sext i32 %11 to i64
%13 = getelementptr inbounds half, half* %8, i64 %12
%14 = bitcast half* %13 to <vscale x 8 x half>*
%wide.load12 = load <vscale x 8 x half>, <vscale x 8 x half>* %14, align 2
%15 = shl i32 %10, 4
%16 = sext i32 %15 to i64
%17 = getelementptr inbounds half, half* %8, i64 %16
%18 = bitcast half* %17 to <vscale x 8 x half>*
%wide.load13 = load <vscale x 8 x half>, <vscale x 8 x half>* %18, align 2
%19 = mul i32 %10, 24
%20 = sext i32 %19 to i64
%21 = getelementptr inbounds half, half* %8, i64 %20
%22 = bitcast half* %21 to <vscale x 8 x half>*
%wide.load14 = load <vscale x 8 x half>, <vscale x 8 x half>* %22, align 2
%23 = fadd fast <vscale x 8 x half> %vec.phi, %wide.load
%24 = fadd fast <vscale x 8 x half> %vec.phi9, %wide.load12
%25 = fadd fast <vscale x 8 x half> %vec.phi10, %wide.load13
%26 = fadd fast <vscale x 8 x half> %vec.phi11, %wide.load14
%index.next = add nuw i64 %index, %7
%27 = icmp eq i64 %index.next, %n.vec
br i1 %27, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd fast <vscale x 8 x half> %24, %23
%bin.rdx15 = fadd fast <vscale x 8 x half> %25, %bin.rdx
%bin.rdx16 = fadd fast <vscale x 8 x half> %26, %bin.rdx15
%28 = call fast half @llvm.vector.reduce.fadd.nxv8f16(half 0xH8000, <vscale x 8 x half> %bin.rdx16)
%cmp.n = icmp eq i64 %n.mod.vf, 0
br i1 %cmp.n, label %L17, label %scalar.ph
scalar.ph: ; preds = %middle.block, %L10.lr.ph
%bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L10.lr.ph ]
%bc.merge.rdx = phi half [ %28, %middle.block ], [ 0xH0000, %L10.lr.ph ]
br label %L10
L10: ; preds = %L10, %scalar.ph
%value_phi18 = phi i64 [ %bc.resume.val, %scalar.ph ], [ %32, %L10 ]
%value_phi7 = phi half [ %bc.merge.rdx, %scalar.ph ], [ %31, %L10 ]
%29 = getelementptr inbounds half, half* %5, i64 %value_phi18
%30 = load half, half* %29, align 2
%31 = fadd fast half %value_phi7, %30
%32 = add nuw nsw i64 %value_phi18, 1
%exitcond.not = icmp eq i64 %32, %3
br i1 %exitcond.not, label %L17, label %L10
L17: ; preds = %L10, %middle.block, %top
%value_phi2 = phi half [ 0xH0000, %top ], [ %28, %middle.block ], [ %31, %L10 ]
ret half %value_phi2
}
julia> @code_native debuginfo=:none sumsimd(randn(Float16, 1_000_000))
.text
.file "sumsimd"
.globl julia_sumsimd_842 // -- Begin function julia_sumsimd_842
.p2align 3
.type julia_sumsimd_842,@function
julia_sumsimd_842: // @julia_sumsimd_842
.cfi_startproc
// %bb.0: // %top
ldr x8, [x0, #8]
cbz x8, .LBB0_3
// %bb.1: // %L10.lr.ph
ldr x9, [x0]
rdvl x11, #2
cmp x8, x11
b.hs .LBB0_4
// %bb.2:
mov x10, xzr
movi d0, #0000000000000000
b .LBB0_7
.LBB0_3:
movi d0, #0000000000000000
// kill: def $h0 killed $h0 killed $z0
ret
.LBB0_4: // %vector.ph
udiv x10, x8, x11
movi d1, #0000000000000000
mov z0.h, #0 // =0x0
ptrue p0.h, vl1
cnth x13
rdvl x15, #1
cnth x16, all, mul #3
mov x12, xzr
add x14, x9, w13, sxtw #1
add x15, x9, w15, sxtw #1
add x16, x9, w16, sxtw #1
sel z1.h, p0, z1.h, z0.h
mov z2.d, z0.d
mov z3.d, z0.d
ptrue p0.h
mul x10, x10, x11
sub x13, x8, x10
.p2align 2
.LBB0_5: // %vector.body
// =>This Inner Loop Header: Depth=1
ld1h { z4.h }, p0/z, [x9, x12, lsl #1]
ld1h { z5.h }, p0/z, [x14, x12, lsl #1]
fadd z1.h, z1.h, z4.h
ld1h { z6.h }, p0/z, [x15, x12, lsl #1]
ld1h { z7.h }, p0/z, [x16, x12, lsl #1]
fadd z0.h, z0.h, z5.h
fadd z2.h, z2.h, z6.h
fadd z3.h, z3.h, z7.h
add x12, x12, x11
cmp x12, x10
b.ne .LBB0_5
// %bb.6: // %middle.block
fadd z0.h, z0.h, z1.h
fadd z0.h, z2.h, z0.h
fadd z0.h, z3.h, z0.h
faddv h0, p0, z0.h
cbz x13, .LBB0_9
.LBB0_7: // %L10.preheader
sub x8, x8, x10
add x9, x9, x10, lsl #1
.p2align 2
.LBB0_8: // %L10
// =>This Inner Loop Header: Depth=1
ldr h1, [x9], #2
fadd h0, h0, h1
subs x8, x8, #1
b.ne .LBB0_8
.LBB0_9: // %L17
// kill: def $h0 killed $h0 killed $z0
ret
.Lfunc_end0:
.size julia_sumsimd_842, .Lfunc_end0-julia_sumsimd_842
.cfi_endproc
// -- End function
.section ".note.GNU-stack","",@progbits
Performance is exactly the same, with and without JULIA_LLVM_ARGS="--aarch64-sve-vector-bits-min=512"
. This is a huge improvement in terms of usability (until I'll get into new exciting crashes).
vscale
registers, fancy.
Nice to see that LLVM 14 is good for something!
Now we just need to emit Float16
instructions outside of fastmath
It'd be great to get good support for SVE, especially as SVE2 will become standard for ARMv9.
However, early tests with using LLVM vector intrinsics on the A64FX did not go well. Here is a minimal example on Godbolt, showing a vectorized (but not unrolled) dot product on the A64FX, which has 512 bit vectors. The problem is that
<8 x double>
gets translated into 4x<2 x double>
NEON instructions, instead of an SVE instruction.v
registers are NEON, and see see that the single@llvm.fma.v8f64
was broken up into 4 separatefmla
instructions. Based on this document, SVE registers would be denoted byz[0-31]
.This makes me wonder if to actually get intrinsic support for SVE, if we'd need to use
<vscale x 2 x double>
, etc, instead? This isn't compelling in Julia (unlikeC
/C++
/wherever folks distribute binaries), since we're probably compiling for the specific target machine anyway, and can easily find the appropriate vector length using@llvm.vscale.i64
.Furthermore, we don't have any way to represent that at the moment.
NTuple{L,Core.VecElement{T}}
<-><L x T>
, but there's no vscale version at the moment.Anyone have any insight into/knowledge about this?