Closed MilesCranmer closed 1 year ago
Hm, it's getting stack overflows again on this test, even with the safe mode:
Test tree construction and scoring: Error During Test at /home/runner/.julia/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:25
Got exception outside of a @test
LoadError: StackOverflowError:
Stacktrace:
[1] vabs(v::VectorizationBase.Vec{8, Float16})
@ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29
[2] abs(a::VectorizationBase.Vec{8, Float16})
@ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/base_defs.jl:46
--- the last 2 lines are repeated 39990 more times ---
[79983] vabs(v::VectorizationBase.Vec{8, Float16})
@ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29
Maybe the safe operators break some of the assumptions? https://github.com/SymbolicML/DynamicExpressions.jl/blob/093b6b10acbd35c4c0142aae0bee3a63fa5b6c95/test/test_params.jl#L4. Or maybe it's an issue with Float16 operations?
safe_log(x::AbstractSIMD) = log(x)
; these are already safeFloat16
isn't supported well/tested much. can_turbo
checks assuming Int
, rather than the element typeFeel free to either define LoopVectorization.check_type(::Type{Float16}) = false
, or make PRs to VectorizationBase/wherever is appropriate to fix things like abs
that are broken.
Probably wouldn't require too much work to get a lot of this working:
https://github.com/JuliaSIMD/VectorizationBase.jl/blob/d39b4a04bc00b5485d7f95d490ce5433cf921cde/src/llvm_intrin/intrin_funcs.jl#L138-L139
Thanks! I can certainly also make it so that only Float32
and Float64
are attempted to be @turbo
ed; all other types use @inbounds @simd
.
How hard do you think it would be for me to build a macro in this package that does a test vectorization on the given type and operator, and, if it fails for whatever reason, it falls back to @inbounds @simd
? I'm hoping to have this be as robust as possible for all user-defined operators.
Maybe a simpler but less general solution would just be to have a pre-defined set of operators that are known to work, and only @turbo
on those.
Here's an example. Could something like this be done in a macro instead of a generated function? Maybe even as a new @turbo_with_fallback
macro in LoopVectorization.jl?
using DynamicExpressions
using LoopVectorization
@generated function unary_operator_kernel!(
x::AbstractArray{T}, y::AbstractArray{T}, ::Val{op_idx}, operators::OperatorEnum
) where {T,op_idx}
# First, we try to @turbo and eval an example array:
num_samples = 32
_x = similar(x, num_samples);
_y = similar(x, num_samples);
# Get operator from type:
unaops = operators.parameters[2]
op = unaops.parameters[op_idx].instance
can_turbo = try
@turbo for i in indices(_x)
_y[i] = op(_x[i])
end
true
catch # Catch ALL errors.
false
end
if can_turbo
quote
@turbo for i in indices(x)
y[i] = $op(x[i])
end
end
else
quote
@inbounds @simd for i in indices(x)
y[i] = $op(x[i])
end
end
end
end
x = randn(Float16, 100);
y = similar(x);
operators = OperatorEnum(;unary_operators=[abs])
unary_operator_kernel!(x, y, Val(1), operators)
Even though abs
is not defined for Float16
types, this will still work - the test @turbo
will get a stack overflow, so @inbounds @simd
will be used in the generator function instead.
Thanks! I can certainly also make it so that only
Float32
andFloat64
are attempted to be@turbo
ed; all other types use@inbounds @simd
.
My suggestion was to PR LoopVectorization to disable it for Float16
in general, or (better) fix it so it works.
Another option is to actually forward and promote type information to the can_turbo
calls.
Thanks, will consider! Would that be the one remaining issue or do you foresee others?
I am thinking if I simply want to add an opt-in “use_turbo” argument to the evaluation function, so that users are not surprised by any errors that come out of evaluation.
Weird, also seeing this for the derivative ops with Float32 variables. Just to be robust maybe it's best I leave it optional for now with a turbo::Bool
kwarg to the evaluation functions.
Testing derivatives with respect to variables, with type=Float32.
ERROR: LoadError: TypeError: non-boolean (VectorizationBase.Mask{4, UInt8}) used in boolean context
Stacktrace:
[1] _pow_grad_x(x::VectorizationBase.Vec{4, Float32}, p::VectorizationBase.Vec{4, Float32}, y::VectorizationBase.Vec{4, Float32})
@ ChainRules ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:323
[2] power_pullback
@ ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:201 [inlined]
...
[20] #eval_diff_tree_array#33
@ ~/Documents/DynamicExpressions.jl/src/EvaluateEquationDerivative.jl:49 [inlined]
[21] #307
@ ./none:0 [inlined]
[22] iterate
@ ./generator.jl:47 [inlined]
[23] collect(itr::Base.Generator{UnitRange{Int64}, var"#307#313"{Matrix{Float32}, OperatorEnum{Tuple{typeof(+), typeof(*), typeof(-), typeof(/), typeof(pow_abs2)}, Tuple{typeof(custom_cos), typeof(exp), typeof(sin)}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(+)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(*)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(-)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(/)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(pow_abs2)}}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(custom_cos)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(exp)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(sin)}}}, Bool, Node{Float32}}})
@ Base ./array.jl:787
[24] top-level scope
@ ~/Documents/DynamicExpressions.jl/test/test_derivatives.jl:85
[25] include(fname::String)
@ Base.MainInclude ./client.jl:476
[26] top-level scope
@ REPL[21]:1
in expression starting at /Users/mcranmer/Documents/DynamicExpressions.jl/test/test_derivatives.jl:45
Okay, with turbo set to optional, I can get the tests to pass. I'm seeing a really nice speed boost!!
using DynamicExpressions
X = randn(Float32, 3, 5_000);
# Feature nodes:
x1, x2 = Node(;feature=1), Node(;feature=2)
# Dynamically construct the expression:
tree = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)
# Evaluate:
@btime tree(X)
# 128 us
@btime tree(X; turbo=true)
# 57 us
Wow that is fast!
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
src/Utils.jl | 3 | 10 | 30.0% | ||
<!-- | Total: | 61 | 68 | 89.71% | --> |
Totals | |
---|---|
Change from base Build 3308712281: | 0.6% |
Covered Lines: | 894 |
Relevant Lines: | 967 |
Still not as nice as a handwritten kernel though (the more complex the expression, the faster a single kernel will be compared to the recursive evaluation):
function f(X)
y = Array{Float32}(undef, size(X, 2))
@turbo for i in indices(y)
x1 = X[1, i]
x2 = X[2, i]
y[i] = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)
end
y
end
@btime f(X)
# 5.2 us
With https://github.com/JuliaSIMD/LoopVectorization.jl/pull/431, this should potentially work for arbitrary user-defined operators, since it will fall back to
@inbounds @simd
when an operator cannot be SIMD-ified. It gets the evaluation even faster for me - now evaluation about 30% faster than a simple handwritten function, and even a little bit faster than a manually-written SIMD loop (which does not use@turbo
)Let's see if this works.
cc @chriselrod