SymbolicML / DynamicExpressions.jl

Ridiculously fast symbolic expressions
https://symbolicml.org/DynamicExpressions.jl/dev
Apache License 2.0
92 stars 12 forks source link

Use `LoopVectorization.@turbo` in dynamic expression evaluation scheme #9

Closed MilesCranmer closed 1 year ago

MilesCranmer commented 1 year ago

With https://github.com/JuliaSIMD/LoopVectorization.jl/pull/431, this should potentially work for arbitrary user-defined operators, since it will fall back to @inbounds @simd when an operator cannot be SIMD-ified. It gets the evaluation even faster for me - now evaluation about 30% faster than a simple handwritten function, and even a little bit faster than a manually-written SIMD loop (which does not use @turbo)

Let's see if this works.

cc @chriselrod

MilesCranmer commented 1 year ago

Hm, it's getting stack overflows again on this test, even with the safe mode:

Test tree construction and scoring: Error During Test at /home/runner/.julia/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:25
  Got exception outside of a @test
  LoadError: StackOverflowError:
  Stacktrace:
       [1] vabs(v::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29
       [2] abs(a::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/base_defs.jl:46
  --- the last 2 lines are repeated 39990 more times ---
   [79983] vabs(v::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29

Maybe the safe operators break some of the assumptions? https://github.com/SymbolicML/DynamicExpressions.jl/blob/093b6b10acbd35c4c0142aae0bee3a63fa5b6c95/test/test_params.jl#L4. Or maybe it's an issue with Float16 operations?

chriselrod commented 1 year ago
  1. Define safe_log(x::AbstractSIMD) = log(x); these are already safe
  2. yeah, Float16 isn't supported well/tested much.
  3. can_turbo checks assuming Int, rather than the element type

Feel free to either define LoopVectorization.check_type(::Type{Float16}) = false, or make PRs to VectorizationBase/wherever is appropriate to fix things like abs that are broken. Probably wouldn't require too much work to get a lot of this working: https://github.com/JuliaSIMD/VectorizationBase.jl/blob/d39b4a04bc00b5485d7f95d490ce5433cf921cde/src/llvm_intrin/intrin_funcs.jl#L138-L139

MilesCranmer commented 1 year ago

Thanks! I can certainly also make it so that only Float32 and Float64 are attempted to be @turboed; all other types use @inbounds @simd.

How hard do you think it would be for me to build a macro in this package that does a test vectorization on the given type and operator, and, if it fails for whatever reason, it falls back to @inbounds @simd? I'm hoping to have this be as robust as possible for all user-defined operators.

Maybe a simpler but less general solution would just be to have a pre-defined set of operators that are known to work, and only @turbo on those.

MilesCranmer commented 1 year ago

Here's an example. Could something like this be done in a macro instead of a generated function? Maybe even as a new @turbo_with_fallback macro in LoopVectorization.jl?

using DynamicExpressions
using LoopVectorization

@generated function unary_operator_kernel!(
    x::AbstractArray{T}, y::AbstractArray{T}, ::Val{op_idx}, operators::OperatorEnum
) where {T,op_idx}
    # First, we try to @turbo and eval an example array:
    num_samples = 32
    _x = similar(x, num_samples);
    _y = similar(x, num_samples);
    # Get operator from type:
    unaops = operators.parameters[2]
    op = unaops.parameters[op_idx].instance
    can_turbo = try
        @turbo for i in indices(_x)
            _y[i] = op(_x[i])
        end
        true
    catch  # Catch ALL errors.
        false
    end
    if can_turbo
        quote
            @turbo for i in indices(x)
                y[i] = $op(x[i])
            end
        end
    else
        quote
            @inbounds @simd for i in indices(x)
                y[i] = $op(x[i])
            end
        end
    end
end

x = randn(Float16, 100);
y = similar(x);
operators = OperatorEnum(;unary_operators=[abs])
unary_operator_kernel!(x, y, Val(1), operators)

Even though abs is not defined for Float16 types, this will still work - the test @turbo will get a stack overflow, so @inbounds @simd will be used in the generator function instead.

chriselrod commented 1 year ago

Thanks! I can certainly also make it so that only Float32 and Float64 are attempted to be @turboed; all other types use @inbounds @simd.

My suggestion was to PR LoopVectorization to disable it for Float16 in general, or (better) fix it so it works.

Another option is to actually forward and promote type information to the can_turbo calls.

MilesCranmer commented 1 year ago

Thanks, will consider! Would that be the one remaining issue or do you foresee others?

I am thinking if I simply want to add an opt-in “use_turbo” argument to the evaluation function, so that users are not surprised by any errors that come out of evaluation.

MilesCranmer commented 1 year ago

Weird, also seeing this for the derivative ops with Float32 variables. Just to be robust maybe it's best I leave it optional for now with a turbo::Bool kwarg to the evaluation functions.

Testing derivatives with respect to variables, with type=Float32.
ERROR: LoadError: TypeError: non-boolean (VectorizationBase.Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] _pow_grad_x(x::VectorizationBase.Vec{4, Float32}, p::VectorizationBase.Vec{4, Float32}, y::VectorizationBase.Vec{4, Float32})
    @ ChainRules ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:323
  [2] power_pullback
    @ ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:201 [inlined]
 ...
 [20] #eval_diff_tree_array#33
    @ ~/Documents/DynamicExpressions.jl/src/EvaluateEquationDerivative.jl:49 [inlined]
 [21] #307
    @ ./none:0 [inlined]
 [22] iterate
    @ ./generator.jl:47 [inlined]
 [23] collect(itr::Base.Generator{UnitRange{Int64}, var"#307#313"{Matrix{Float32}, OperatorEnum{Tuple{typeof(+), typeof(*), typeof(-), typeof(/), typeof(pow_abs2)}, Tuple{typeof(custom_cos), typeof(exp), typeof(sin)}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(+)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(*)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(-)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(/)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(pow_abs2)}}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(custom_cos)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(exp)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(sin)}}}, Bool, Node{Float32}}})
    @ Base ./array.jl:787
 [24] top-level scope
    @ ~/Documents/DynamicExpressions.jl/test/test_derivatives.jl:85
 [25] include(fname::String)
    @ Base.MainInclude ./client.jl:476
 [26] top-level scope
    @ REPL[21]:1
in expression starting at /Users/mcranmer/Documents/DynamicExpressions.jl/test/test_derivatives.jl:45
MilesCranmer commented 1 year ago

Okay, with turbo set to optional, I can get the tests to pass. I'm seeing a really nice speed boost!!

using DynamicExpressions

X = randn(Float32, 3, 5_000);

# Feature nodes:
x1, x2 = Node(;feature=1), Node(;feature=2)
# Dynamically construct the expression:
tree = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)

# Evaluate:
@btime tree(X)
# 128 us
@btime tree(X; turbo=true)
# 57 us

Wow that is fast!

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 3331534317


Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Utils.jl 3 10 30.0%
<!-- Total: 61 68 89.71% -->
Totals Coverage Status
Change from base Build 3308712281: 0.6%
Covered Lines: 894
Relevant Lines: 967

💛 - Coveralls
MilesCranmer commented 1 year ago

Still not as nice as a handwritten kernel though (the more complex the expression, the faster a single kernel will be compared to the recursive evaluation):

function f(X)
    y = Array{Float32}(undef, size(X, 2))
    @turbo for i in indices(y)
        x1 = X[1, i]
        x2 = X[2, i]
        y[i] = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)
    end
    y
end

@btime f(X)
# 5.2 us