Closed baggepinnen closed 4 years ago
A few comments:
f
valid on SVec
.
This comment also applies to @vectorize
loops, so it is more of a general limitation of the library.
However, it is also necessary for f
to be marked @inline
until this issue gets resolved, which makes it more severe in cases such as your example.
A loop may encourage you to essentially write out the body of the function (many little pieces that may be more reasonable to inline), while f
itself could be a larger function.Demo of the issue:
julia> using SIMDPirates, BenchmarkTools
julia> x = ntuple(Val(8)) do i Core.VecElement(randn()) end
(VecElement{Float64}(-1.0759056141831105), VecElement{Float64}(-1.1579962137902238), VecElement{Float64}(2.560294914431641), VecElement{Float64}(-1.1957407117264527), VecElement{Float64}(-2.132117101923461), VecElement{Float64}(-0.5877224346584126), VecElement{Float64}(0.8586640177057222), VecElement{Float64}(-0.7302450769652871))
julia> sx = SVec(x)
SVec{8,Float64}<-1.0759056141831105, -1.1579962137902238, 2.560294914431641, -1.1957407117264527, -2.132117101923461, -0.5877224346584126, 0.8586640177057222, -0.7302450769652871>
julia> exp(sx)
SVec{8,Float64}<0.3409888111263572, 0.3141149699378415, 12.939632837353068, 0.3024798208329842, 0.11858596932271068, 0.5555912401702848, 2.3600056608637496, 0.4817908997685664>
julia> @btime sum(exp($sx))
7.194 ns (0 allocations: 0 bytes)
17.413190209375564
julia> @btime exp($sx)
julia: /home/chriselrod/Documents/languages/julia/src/cgutils.cpp:514: unsigned int convert_struct_offset(llvm::Type*, unsigned int): Assertion `SL->getElementOffset(idx) == byte_offset' failed.
signal (6): Aborted
in expression starting at REPL[6]:1
raise at /usr/src/debug/glibc-2.30-298.x86_64/signal/../sysdeps/unix/sysv/linux/internal-signals.h:84
abort at /usr/src/debug/glibc-2.30-298.x86_64/stdlib/abort.c:79
__assert_fail_base at /usr/src/debug/glibc-2.30-298.x86_64/assert/assert.c:92
__assert_fail at /usr/src/debug/glibc-2.30-298.x86_64/assert/assert.c:101
convert_struct_offset at /home/chriselrod/Documents/languages/julia/src/cgutils.cpp:514 [inlined]
convert_struct_offset at /home/chriselrod/Documents/languages/julia/src/cgutils.cpp:509
convert_struct_offset at /home/chriselrod/Documents/languages/julia/src/cgutils.cpp:520 [inlined]
emit_new_struct at /home/chriselrod/Documents/languages/julia/src/cgutils.cpp:2619
emit_builtin_call at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:2596
emit_call at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:3331
emit_expr at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:4103
emit_ssaval_assign at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:3807
emit_stmtpos at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:4000 [inlined]
emit_function at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:6578
jl_compile_linfo at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:1230
emit_invoke at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:3277
emit_expr at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:4094
emit_ssaval_assign at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:3807
emit_stmtpos at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:4000 [inlined]
emit_function at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:6578
jl_compile_linfo at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:1230
emit_invoke at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:3277
emit_expr at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:4094
emit_ssaval_assign at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:3807
emit_stmtpos at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:4000 [inlined]
emit_function at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:6578
jl_compile_linfo at /home/chriselrod/Documents/languages/julia/src/codegen.cpp:1230
jl_compile_method_internal at /home/chriselrod/Documents/languages/julia/src/gf.c:1889
_jl_invoke at /home/chriselrod/Documents/languages/julia/src/gf.c:2153 [inlined]
jl_apply_generic at /home/chriselrod/Documents/languages/julia/src/gf.c:2322
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1654 [inlined]
do_apply at /home/chriselrod/Documents/languages/julia/src/builtins.c:634
jl_f__apply at /home/chriselrod/Documents/languages/julia/src/builtins.c:648 [inlined]
jl_f__apply_latest at /home/chriselrod/Documents/languages/julia/src/builtins.c:684
#invokelatest#1 at ./essentials.jl:715 [inlined]
invokelatest##kw at ./essentials.jl:710 [inlined]
#run_result#37 at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:32 [inlined]
run_result##kw at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:32 [inlined]
#run#39 at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:46
run##kw at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:46 [inlined]
run##kw at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:46 [inlined]
#warmup#42 at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:79 [inlined]
warmup at /home/chriselrod/.julia/packages/BenchmarkTools/7aqwe/src/execution.jl:79
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1654 [inlined]
do_call at /home/chriselrod/Documents/languages/julia/src/interpreter.c:328
eval_value at /home/chriselrod/Documents/languages/julia/src/interpreter.c:417
eval_stmt_value at /home/chriselrod/Documents/languages/julia/src/interpreter.c:368 [inlined]
eval_body at /home/chriselrod/Documents/languages/julia/src/interpreter.c:760
jl_interpret_toplevel_thunk_callback at /home/chriselrod/Documents/languages/julia/src/interpreter.c:888
Lenter_interpreter_frame_start_val at /home/chriselrod/Documents/languages/julia/usr/bin/../lib/libjulia.so.1 (unknown line)
jl_interpret_toplevel_thunk at /home/chriselrod/Documents/languages/julia/src/interpreter.c:897
jl_toplevel_eval_flex at /home/chriselrod/Documents/languages/julia/src/toplevel.c:814
jl_toplevel_eval_flex at /home/chriselrod/Documents/languages/julia/src/toplevel.c:764
jl_toplevel_eval at /home/chriselrod/Documents/languages/julia/src/toplevel.c:823 [inlined]
jl_toplevel_eval_in at /home/chriselrod/Documents/languages/julia/src/toplevel.c:843
eval at ./boot.jl:331
eval_user_input at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
run_backend at /home/chriselrod/.julia/packages/Revise/0KQ7U/src/Revise.jl:1033
#85 at ./task.jl:349
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1654 [inlined]
start_task at /home/chriselrod/Documents/languages/julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 32354569 (Pool: 32346408; Big: 8161); GC: 42
fish: “/home/chriselrod/Documents/lang…” terminated by signal SIGABRT (Abort)
Returning an NTuple{N,Core.VecElement{Float64}}
, or one wrapped in a struct like the SVec
type, is likely to crash Julia.
To implement map
, I'd just write a function vmap!(f, dest, collection...)
function using an @vectorize for...
loop. The hardest part would be the vmap(f, collection...)
function allocating a dest
object with the correct eltype
.
Broadcasting is harder, because it will want multiple loops in general. We wont always wont to fuse them into a single loop, because if we have
x = rand(4);
y = rand(3);
A = rand(4,3);
B = @vectorize log.(x) .* sin.(A) .+ exp.(y')
we would want to be able to create our own version of a Base.materialize!
function that generates code like:
R, C = size(A)
Rrep, Rrem = divrem(R, W) # W is the vector width
@inbounds for r in 0:Rrep-1
vlogx = log(vload(Vec{W,eltype(x)}, pointer(x) + + r*W*sizeof(eltype(x)) )
for c in 0:C-1
vexpy = vbroadcast(Vec{W,eltype(y)}, exp(y[c+1]))
vA = vload(Vex{W,eltype(A)}, pointer(A) + (r*W + c*stride(A,2))* sizeof(eltype(A)))
vB = vmuladd(vlogx, vA, vexpy)
vstore!(pointer(B) + (r*W + c*stride(A,2))* sizeof(eltype(B)), vB)
end
end
(plus similar code to handle the remaining rows, Rrem
).
I have been working on code to try and do this inside the graphs branch.
So far I have been working on trying to calculate the "cost" of different methods and orders of evaluating a set of loops. From there, I was planning on using the really naive (brute-force) approach of just iterating over lots of different approaches and picking the least costly for that set of loops.
Specifically, if we have N
nested loops, the plan is for it to iterate over all N!
possible loop orders, and evaluate the cost of unrolling the first of the nested loops, and if N > 1
, also of tiling the first two nested loops. Then it will pick the minimum cost approach.
The N!
of course means that it will only be feasible for small N
. Maybe someone can come up with a smarter search algorithm.
If input arrays are all "dense", then it can consider fusing loops. That is, if we're calculating A .+ B
and A
and B
are matrices of the same size, we would want to use a single loop, rather than a loop over rows, and another over columns.
The problem with this is that it is only easily solveable for fixed size arrays, where we have access to their sizes at compile time.
Otherwise, what if A
is R
by C
, and B
is R
by 1
, 1
by C
, or 1
by 1
? Broadcasting is still perfectly valid in those cases.
My proposed solution would be to make an @vectorize
broadcast slightly more restrictive than a regular broadcast, by refusing to broadcast over 1-dimensional axis that cannot be identified at compile time. This means neither a = rand(1,4)
or a = rand(4,1)
would be allowed, but a = rand(4)
and a = rand(4)')
would both be fine.
I could also define an AbstractArray type of arbitrary dimensions with an arbitrary number of 1-axis that can be used like the adjoints, to make it equally expressive as regular broadcasting, but just a little more verbose / requiring that array wrapper type.
To actually do a reasonable job modeling computational cost, the functions being evaluated cannot be opaque.
For now, it just uses the function's symbol to try and guess the cost; if the function is unrecognized, it will assume a moderately costly "opaque instruction" that will prevent tiling or any unrolling beyond your computer's vector width.
It should still be able to pick a good loop order, because the getindex
and setindex!
calls will still be clear. Meaning C' .= f.(A', B')
would still pick the correct order of basically untransposing the inputs.
f.(A', B')
by itself is more interesting however. Would we want to both give it the freedom to return a transposed matrix, and also make it smart enough to do that? That is, have it figure out that it would be faster to do dest = similar(promote_size(src1', src2'))'
than it would be to do dest = similar(promote_size(src1, src2))
?
EDIT:
Eventually, I would also like the macro to consider packing.
Packing would also allow caching exp(y[c])
the results from
@vectorize log.(x) .* sin.(A) .+ exp.(y')
so that we aren't reevaluating exp(y[c])
a total of Rrep + (Rrem > 0)
times.
By considering packing it should be able to transform the following,
@vectorize for m in 1:M, n in 1:N, k in 1:K
C[m,n] += A[m,k] * B[k,n]
end
into an optimized gemm!
-- only more flexible, because you can combine it with other ops.
Ideally, it should also be able to consider splitting loops up into multiple (rather than just letting you fuse), but all of this will take quite some time.
It does now support vectorized broadcast. I'll try and add a vmap
and vmap!
function on the next release.
Awesome developments! I'll be looking forward to experimenting with this in MonteCarloMeasurements :D
This is just me dreaming about the future, but wouldn't it be neat if one could write stuff like
? :smiley: