JuliaLinearAlgebra / Octavian.jl

Multi-threaded BLAS-like library that provides pure Julia matrix multiplication
https://julialinearalgebra.github.io/Octavian.jl/stable/
Other
230 stars 18 forks source link

Generalize matmul to arbitrary + and * #102

Open rayegun opened 3 years ago

rayegun commented 3 years ago

This request is a little out there, so it'd be totally reasonable to close it. TLDR and Tullio thoughts at the end.

My JSOC is to wrap and provide ChainRules for SuiteSparse:GraphBLAS, but I think Julia is a perfect language for its own implementation (and the original Julia authors agree, an early paper by them is effectively a simple GraphBLAS). GraphBLAS is a sparse linear algebra library where the arithmetic semiring (+, *) is replaced with an arbitrary one like (max, +) for instance.

So a stretch goal for the JSOC and/or a future personal goal is a Julia native GraphBLAS. Most of that work would be in a new sparse matrix library, but dense operations would be necessary as well of course.

Reusing Octavian, or a future complete JuliaBLAS, would be quite desirable for the dense side of things, so that brings me to two questions:

  1. Is there a performance penalty to arbitrary user provided binary operations? I think the answer here is not much. I've done some basic tests with this:
function _matmul!(
    y::AbstractVector{T}, A::AbstractMatrix, x::AbstractVector, α, β, MKN, contig_axis; 
    mul = *
) where {T}
  @tturbo for m ∈ indices((A,y),1)
    yₘ = zero(T)
    for n ∈ indices((A,x),(2,1))
      yₘ += mul(A[m,n], x[n])
    end
    y[m] = α * yₘ + β * y[m]
  end
  return y
end

and performance is pretty much exactly the same as the existing implementation for mul = *. And it remains pretty fast for other operators. Now obviously this is not even a full implementation for mxv, I'd need to figure out how to please LoopVectorization with add = <arbitrary binaryop>; yₘ = add(yₘ, mul(A[m,n], x[n])), and mxm is a whole can of worms. But if I can expect little to no performance penalty I'd be happy to open it.

  1. Is this in scope? My hope is yes. It would avoid forks etc, or some rather unpleasant workarounds like defining new Number subtypes with different + and * ops (that's one option the Julia authors took in their paper). It would also make Julia's linear algebra extraordinarily flexible. I can find plenty of papers citing the usefulness of this flexibility, but the question is really whether there's a significant population of current or potential Julia users who would take advantage of this.

Some binary operators would beg for their own implementation, those that short circuit for instance, but that's easily handled, and the fallback would still be faster than almost any other library.

If this sounds like a good idea, and someone can give me a pointer about fixing Reduction not found. for yₘ = add(yₘ, mul(A[m,n], x[n])) then I can probably figure out the rest.

E: The dense version of GraphBLAS is simply a subset of Tullio I believe now that I've looked into it. That makes things a lot simpler for the dense side, especially if there's a potential for Tullio matching Octavian in performance eventually. Currently Tullio is quite a bit slower than SuiteSparse:GraphBLAS for the equivalent operation, but I imagine there's room for optimization.

If Tullio performance is predicted to approach BLAS then I can just use Tullio. If it's not then I'd like to try "generalized-Octavian".

TLDR: Is it performant and desirable to replace instances of _ * _ with mul(_,_) and _ += _ with _ = add(_, _) to support arbitrary mul and add? Or should this be done in Tullio, hopefully with some perf improvements.

chriselrod commented 3 years ago

GraphBLAS is a sparse linear algebra library where the arithmetic semiring (+, *) is replaced with an arbitrary one like (max, +) for instance.

Have you seen https://github.com/TensorBFS/TropicalGEMM.jl ?

So a stretch goal for the JSOC and/or a future personal goal is a Julia native GraphBLAS. Most of that work would be in a new sparse matrix library, but dense operations would be necessary as well of course. and mxm is a whole can of worms.

What size dense operations? For "mxm" (matrix * matrix), it takes a while before Octavian starts beating a simple LoopVectorization.@(t)turbo triple-nested loop. The size depends on which arguments if any are transposed, the L2 cache size per core, and the number of cores you're running on.

I'd need to figure out how to please LoopVectorization with add = ; yₘ = add(yₘ, mul(A[m,n], x[n])) If this sounds like a good idea, and someone can give me a pointer about fixing Reduction not found. for yₘ = add(yₘ, mul(A[m,n], x[n])) then I can probably figure out the rest.

That'd require a LoopVectorization PR. LoopVectorization needs to know what the "0" is with respect to a reduction in order to be able to generate code in different ways without changing the answer. It only supports a few reductions, in particular: +, -, *, max, min.

@turbo works by creating a call to a generated function. This generated function uses type information to find out important information, like what function arguments are; e.g. that mul === *.[1]

With that in mind, options are

  1. Reduce the restriction on supported reductions by allowing it to avoid transforms that require it to know what zero is w/ respect to the reduction when that zero is unknown.
  2. Delay the checks for what the zero is until inside the @generated function. Currently, it's required at macro-expansion time, at which point it can't yet know whether add === +, add === max, or add === fubar.

Both could of course be implemented together. If you'd be interested in working on something like that, I'd be happy to provide guidance on what needs to be done and answer any questions.

Is this in scope?

Yes, I think that's a great use case where a JuliaBLAS library can really shine. Would be good to generalize that here, and simplify the TropicalGEMM implementation to just a reinterpret + passing mul and add arguments.

If Tullio performance is predicted to approach BLAS then I can just use Tullio. If it's not then I'd like to try "generalized-Octavian".

That's a question for @mcabbott. Reasons why Octavian is faster than Tullio:

  1. Octavian uses a blocking algorithm based on the cache sizes.
  2. Octavian uses ThreadingUtilities for its threads. It should switch to using Polyester when I or someone else gets around to updating it (Polyester is built on top of ThreadingUtilities, but allows scheduling to not conflict with other libraries using Polyester).
  3. Octavian will pack arrays when it is beneficial. It helps when some arguments are transposed (especially A in A*B), and when the arguments are large, and (if A is not transposed) when stride(A,2)*sizeof(eltype(A)) is not a multiple of your simd register size.

[1] This should answer your question on performance implications: if LoopVectorization can recognize the function (i.e., it's included in this cost table), there should be no real difference between passing mul as an argument vs writing it directly.

rayegun commented 3 years ago

Have you seen https://github.com/TensorBFS/TropicalGEMM.jl ?

No I had not, that is quite interesting. I would, however, like to take the alternate route of generalizing operations, rather than eltypes. It is much more flexible imho, and allows the use of the arithmetic, tropical, etc. semirings on the same underlying array.

What size dense operations?

Arbitrary. The end goal being a sparse and dense generalized linear algebra lib, fast when it's achievable, like here!

If you'd be interested in working on something like that, I'd be happy to provide guidance on what needs to be done and answer any questions.

Definitely! If we can get any commutative monoid in for add, and any monoid in for mul that covers everything I think (some finicky but useful things like positional rather than value based operators would require other changes to Octavian but not LoopVec I don't think). Just point me in the right direction.

Would be good to generalize that here, and simplify the TropicalGEMM implementation to just a reinterpret + passing mul and add arguments.

Yes hopefully everything they do is just a special case. I do have some thoughts about sparse tensors I'd like to talk with you about at some point in a similar vein.

Octavian will pack arrays when it is beneficial

Does packing occur with dense*dense? I'm still wrapping my head around it.

In any case, like I said just point me in the right direction whenever you have a chance and I'll try to work up both those extensions to LoopVec, and then the (relatively simple) changes to Octavian.

chriselrod commented 3 years ago

No I had not, that is quite interesting. I would, however, like to take the alternate route of generalizing operations, rather than eltypes. It is much more flexible imho, and allows the use of the arithmetic, tropical, etc. semirings on the same underlying array.

I don't want to make any promises about a time frame, but a goal is to eventually split LoopVectorization into pieces, namely a frontend, scheduler, and backend.

A frontend using the AbstractInterpreter could then theoretically be written, and see that for a given type * is overloaded to be +(::Float64,::Float64), etc -- and everything will come more or less for free when defining a type. This is still a long way off, but type-based approaches should at least be easy in the future.

Arbitrary. The end goal being a sparse and dense generalized linear algebra lib, fast when it's achievable, like here!

Okay. If it was small only, then we could've avoided some of the complexity.

Definitely! If we can get any commutative monoid in for add, and any monoid in for mul that covers everything I think (some finicky but useful things like positional rather than value based operators would require other changes to Octavian but not LoopVec I don't think). Just point me in the right direction.

Here is where it errors.

The first thing I know there is that that code really should be cleaned up. The second thing is that instrclass is actually only used on one side of the following if add_reduct_instruct branch. reduct_zero = reduction_zero(instrclass) is also only used on that branch, so a lot of code could be moved there.

So, it looks like to get it to stop erroring at macro-expansion time, that might be all that it takes.

For the record, what that code is doing is translating

for i in I
    xi = x[i]
    for j in J
       xi += f(args...)
    end
    x[i] = xi
end    

into something like

for i in I
    xi = 0
    for j in J
       xi += f(args...)
    end
    x[i] = xi + x[i]
end    

The purpose of this change is so that x[i] only has to be in our cache once instead of twice.

I'd start with cleaning up that code so reduction_instruction_class only gets called if add_reduct_instruct is actually true, and then see if there any more errors. Or I could probably do that myself in just a couple minutes, your call.

This won't add support for "any commuative monoid for add", but it would add it for all of them in this dictionary.

Something that'd involve a bigger change, but I think may be a more flexible and easily extended approach is to have it use something like this:

reduction_zero(::typeof(+), ::Type{T}) where {T} = zero(T)
reduction_zero(::typeof(max), ::Type{T}) where {T} = typemin(T)

for initializing reduction zeros. Then we could punt things to users. If they want to use a new function foo for a reduction, it'd then be relatively straightforward to define reduction_zero(::type(foo), ::Type{T}) appropriately.

You can use the unused instruction field of the initializing Operation:

julia> M = K = N = 96; A = rand(M,K); B = rand(K,N); C1 = @time(A*B); C0 = similar(C1);
  0.000060 seconds (2 allocations: 72.078 KiB)

julia> ls = let C = C0
       LoopVectorization.@turbo_debug for n ∈ indices((B,C),2), m ∈ indices((A,C),1)
           Cmn = zero(eltype(C))
           for k ∈ indices((A,B),(2,1))
               Cmn += A[m,k] * B[k,n]
           end
           C[m,n] = Cmn
       end; end;

julia> ls.operations
6-element Vector{LoopVectorization.Operation}:
 var"##op#435" = var"###zero###13###"
 var"##op#436" = A[m, k]
 var"##op#437" = B[k, n]
 var"##op#435" = LoopVectorization.vfmadd_fast(var"##op#436", var"##op#437", var"##op#435")
 var"##op#435" = LoopVectorization.identity(var"##op#435")
 C[m, n] = var"##op#435"

julia> ls.operations[1]
var"##op#435" = var"###zero###13###"

julia> ls.operations[1].instruction
LoopVectorization.Instruction(:numericconstant, Symbol("###zero###13###"))

to store relevant information. For example, make it LoopVectorization.Instruction(:reduction_zero, :foo), and then have lower_constant! check the instruction for whether its first field is reduction_zero, and if so instead of getting the instrclass as it does now:

    call = if reducedchildvectorized && vloopsym ∉ loopdependencies(op)
      instrclass = getparentsreductzero(ls, op)
      if instrclass == ADDITIVE_IN_REDUCTIONS
        Expr(:call, vecbasefunc(:addscalar), Expr(:call, lv(:vzero), VECTORWIDTHSYMBOL, ELTYPESYMBOL), constsym)
      elseif instrclass == MULTIPLICATIVE_IN_REDUCTIONS
        Expr(:call, vecbasefunc(:mulscalar), Expr(:call, lv(:vbroadcast), VECTORWIDTHSYMBOL, Expr(:call, :one, ELTYPESYMBOL)), constsym)
      elseif instrclass == MAX
        Expr(:call, vecbasefunc(:maxscalar), Expr(:call, lv(:vbroadcast), VECTORWIDTHSYMBOL, Expr(:call, :typemin, ELTYPESYMBOL)), constsym)
      elseif instrclass == MIN
        Expr(:call, vecbasefunc(:minscalar), Expr(:call, lv(:vbroadcast), VECTORWIDTHSYMBOL, Expr(:call, :typemax, ELTYPESYMBOL)), constsym)
      else
        throw("Reductions of type $(reduction_zero(reinstrclass)) not yet supported; please file an issue as a reminder to take care of this.")
      end

you'd create a call to Expr(:call, lv(reduction_zero), instr.instr, ELTYPESYMBOL) instead of the calls like Expr(:call, :typemax, ELTYPESYMBOL)).

You'd have to check all the uses of these functions you're getting rid of and apply a similar strategy:

> git grep -nr 'reduction_instruction_class' src
src/codegen/lower_compute.jl:384:  reduct_class = reduction_instruction_class(oppt.instruction)
src/codegen/lower_constant.jl:90:            return reduction_instruction_class(instruction(opp))
src/codegen/lower_store.jl:19:    reduction_to_scalar(reduction_instruction_class(instruction(opp)))
src/codegen/lowering.jl:505:    z = outer_reduction_zero(op, u₁u, Umax, reduction_instruction_class(instruction(op)), rs)
src/codegen/lowering.jl:628:    reduct_class = reduction_instruction_class(instruction(op))
src/modeling/costs.jl:342:reduction_instruction_class(instr::Symbol) = get(REDUCTION_CLASS, instr, NaN)
src/modeling/costs.jl:343:reduction_instruction_class(instr::Instruction) = reduction_instruction_class(instr.instr)
src/modeling/costs.jl:361:reduction_to_single_vector(x) = reduction_to_single_vector(reduction_instruction_class(x))
src/modeling/costs.jl:379:reduce_to_onevecunroll(x) = reduce_to_onevecunroll(reduction_instruction_class(x))
src/modeling/costs.jl:397:reduce_number_of_vectors(x) = reduce_number_of_vectors(reduction_instruction_class(x))
src/modeling/costs.jl:402:# reduction_to_scalar(x) = reduction_to_scalar(reduction_instruction_class(x))
src/modeling/costs.jl:420:reduction_to_scalar(x) = reduction_to_scalar(reduction_instruction_class(x))
src/modeling/costs.jl:439:reduction_scalar_combine(x) = reduction_scalar_combine(reduction_instruction_class(x))
src/modeling/costs.jl:444:# reduction_combine_to(x) = reduction_combine_to(reduction_instruction_class(x))
src/modeling/costs.jl:480:reduction_zero(x) = reduction_zero(reduction_instruction_class(x))
src/parse/add_compute.jl:190:    instrclass = reduction_instruction_class(instruction(vparents[2])) # key allows for faster lookups
src/parse/add_compute.jl:192:    instrclass = reduction_instruction_class(instr) # key allows for faster lookups

We could setup a call if that'd help.