JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
741 stars 68 forks source link

Loop IR interface #230

Open Roger-luo opened 3 years ago

Roger-luo commented 3 years ago

As far as I understand, LoopVectorization will parse a Julia Expr to an internal IR which is the LoopSet (or something other types in https://github.com/JuliaSIMD/LoopVectorization.jl/blob/f2e60386486074140c1eeb4fe5a87b2b330997b8/src/modeling/graphs.jl#L1)

I'm currently trying to simplify the kernel code generation code in BQCESubroutine: https://github.com/QuantumBFS/BQCESubroutine.jl/tree/master/src/codegen by writing a Loop IR in Expronicon.jl: https://github.com/Roger-luo/Expronicon.jl

I'm wondering if there would be some docs from LoopVectorization or interfaces that allow me to use the internal IR in LoopVectorization directly (or let LoopVectorization to share a Loop IR with other packages) so that I could provide you more information.

e.g in BQCESubroutine, what it tries to do is to generate subspace matrix multiplication routines for a matrix expression that may only contain sin, cos, exp so that I will be able to simplify the kernel using rewrite engine based on this assumption, and sometimes one can put assumptions on the element types in their DSL without going through type inference too.

But I currently need to do a manual loop unroll etc. after this simplification pass, so I'm thinking maybe exposing some interface of the internal IR would make meta programming generated for loops perform faster?

chriselrod commented 3 years ago

I demoed rewriting the IR into loops performing a reverse pass last year. I think reviving that effort and adding support for some AD system would be another great use case.

I am slowly working on a rewrite that will address a few limitations in the current approach. It will:

  1. Allow representing dependencies across loop iterations, so that LoopVectorization can avoid incorrect reordering.
  2. Allow inner loops to have bounds that are affine functions of outer loops. E.g, triangular loops.
  3. Allow for multiple loops to exist at the same level of the nest.

It'll also aim to split the current LoopSet into pieces related to their use. E.g, separate out code parsing and building a LoopSet, as much of this code will likely be different based on how you're building it. Similarly, different parts require different caches, and not all parts are needed at the same time. For example, the usual pipeline is:

  1. Parse an expression into a LoopSet. Note that in my benchmarks this is actually faster than Meta.lower of the Expr
  2. Convert to type representation
  3. The _avx_! @generated function then passes that to recreate the LoopSet.
  4. Convert into an Expr

Other changes include a focus on performance, e.g. using Ints instead of Symbols internally. However, after the first couple @avx functions are compiled, most of the time is normally not spent inside LoopVectorization itself, but in other parts of codegen to compile the created function. Therefore a bigger speedup may be possible if LoopVectorization were to emit lower level code itself. Maybe I should be looking to just LLVM.jl. Things like dependencies between variables are lost in the LoopSet -> Expr transition, which the Julia compiler must then reinfer. I don't know where the bottlenecks are exactly, but as much of the time was in the Julia compiler maybe LLVM.jl is the best long term solution. The major difficulty of that approach is that I'm abusing multiple dispatch a lot to make generating code easier. I'd types determine behavior, you can greatly simplify the codegen step.

Thereb is a ton of code currently dedicated to the LoopSet -> Expr conversion. Since of this is really bad (i.e., how it assigns names to variables is a nightmare that I'll do differently for the new LoopSet, no matter the codegen approach I take there). But I also put a lot of work into making it generate efficient code, e.g. the LoopStartStopManager managing whether arrays are indexed or whether we're instead incrementing pointers to the arrays in each loop iteration.

The internal IR will probably go into a separate package, which has the advantage of being able to bump it's own major version. These were previously considered LoopVectorization internals, which made life harder for (my) libraries built atop it and using undocumented APIs.

sometimes one can put assumptions on the element types in their DSL without going through type inference too.

Do you intend to skip the @generated function used for figuring out types?

Roger-luo commented 3 years ago

I am slowly working on a rewrite that will address a few limitations in the current approach. It will:

I'm wondering would you consider generate the loop IR from inferred Julia SSA IR? this would give you type information to let you handle composite types. I think one can add custom IR into Julia by using abstract interpreter mechanism now (but you will need to implement your own lowering pass to LLVM however) there were some early code in IRTools the generate the loop code from SSA IR's CFG: https://github.com/FluxML/IRTools.jl/blob/master/src/passes/relooper.jl

I think there are some overlapping with KernelCompiler.jl if you are going this direction, the implementation in BQCESubroutine is actually a poor man's version of it dedicated to quantum circuit simulation.

I think this could be a good use case to actually support a loop IR in Julia SSA IR to support better loop optimization.

Do you intend to skip the @generated function used for figuring out types?

Yes, for example, in BQCESubroutine, I can assume I will be only generating the code for Float32 and Float64.

The internal IR will probably go into a separate package, which has the advantage of being able to bump it's own major version.

It would be nice if at that time Expronicon can help on the IR side, so we can have some sort of unified interface for custom semantics in Julia with pattern matching support. I'm trying to split out the loop generation code from BQCESubroutine at the moment, but it doesn't do as much optimization as LoopVectorization does.

chriselrod commented 3 years ago

I'm wondering would you consider generate the loop IR from inferred Julia SSA IR? this would give you type information to let you handle composite types.

LoopVectorization already has type information because of the @generated approach, but the problem (for composite types) is knowing how to use it. The advantage of using Julia IR and AbstractInterpreter is that I think it should let us eliminate the composite types altogether. I.e., keep inlining and SROA-ing everything until all we're left with is operations on primitive types.

I think one can add custom IR into Julia by using abstract interpreter mechanism now (but you will need to implement your own lowering pass to LLVM however) there were some early code in IRTools the generate the loop code from SSA IR's CFG: https://github.com/FluxML/IRTools.jl/blob/master/src/passes/relooper.jl

I think there are some overlapping with KernelCompiler.jl if you are going this direction, the implementation in BQCESubroutine is actually a poor man's version of it dedicated to quantum circuit simulation.

I think this could be a good use case to actually support a loop IR in Julia SSA IR to support better loop optimization.

I confess that I'm still out of my element here. I'll have to take some time to take a look eventually; given plenty of other things to do (e.g., the new internal representation of loops), I've been procrastinating in hopes that the AbstractInterpreter ecosystem is better documented/more stable/has existing examples I can look at to help get started.

Would a custom lowering pass to LLVM really be necessary? If you count all of VectorizationBase's infrastructure meant to support LoopVectorization's codegen, especially around memory addressing, then the large majority of code is just about codegen.

It can perform "peephole" style optimizations by lowering code in certain ways, and then VectorizationBase can define specific overloads or generated functions that loop for optimizable patterns. An entirely different approach will be necessary, but that's not necessarily a bad thing. For example, the same infrastructure for those optimizations could then perhaps also be used by cost modeling (rather than duplicating pieces looking for patterns, as is done now).

Yes, for example, in BQCESubroutine, I can assume I will be only generating the code for Float32 and Float64.

The @generated approach it takes now will tell it specifically the element type and vector width to use. Float32 and Float64 can lead to different unrolling behavior, e.g. imagine one loop is statically known to be of length 8 on a system with 256 bit vectors. Plus it allows for extra information such as Matrix{T} vs Adjoint{T,Matrix{T}}; can you provide information about memory layout, or will it have to make possibly incorrect assumptions while optimizing?

(The AbstractInterpreter approach of course handles this.)

It would be nice if at that time Expronicon can help on the IR side, so we can have some sort of unified interface for custom semantics in Julia with pattern matching support. I'm trying to split out the loop generation code from BQCESubroutine at the moment, but it doesn't do as much optimization as LoopVectorization does.

I'd be happy to talk more about this / general design of a loop optimization ecosystem. Different components:

  1. Loop IR.
  2. Cost modeling and "transformations" on the Loop IR. In LoopVectorization's case, it doesn't actually transform, but generate more of a description/primitive schedule that tells codegen how exactly it should generate code from the IR. This schedule will get much more sophisticated with the rewrite.
  3. Actual generation of code that can somehow be run. Either a Julia Expr, Julia IR, or LLVM directly (via LLVM.jl?). This is dependent on both 1. and 2. -- how it works is dependent on both the IR, and the types of schedules received.
  4. Mechanism for creating typed loop IR. Currently LoopVectorization does this one of two ways: (a) parse an Expr, and convert it to type information passed to an @generated function. This @generated function then has access to all the type information of the arrays, e.g. PermutedDimsArray{Float32, 3, (2, 3, 1), (3, 1, 2), Array{Float32, 3}}, or (b) create the loops implied based on Base.Broadcasted objects when broadcasting. In the future, we'd want to integrate this with AbstractInterpreter to handle composite types. This is independent of "2." and "3.", but of course depends on "1.".

Where do you see Expronicon fitting in? Sounds like "1." and "3.", which would require "2." as well?

Roger-luo commented 3 years ago

So I'm splitting out two packages currently

  1. Expronicon for Julia Expr level manipulation which should be compatible with all 1.x versions
  2. Yuan for Julia SSA IR manipulation (that is CodeInfo and IRCode) which is only compatible with 1.6

If the Loop IR is defined in a style compatible with Julia's Expr (e.g if it can be used to generate a Julia Expr for loop) then it would be in Expronicon. For Yuan I will be implementing a lowering pass to LLVM since I need that for YaoCompiler too. (and split out some common tools for working with abstract interpreter)

I currently only split out the loop IR I used in BQCESubroutine to Expronicon. I'm still thinking about how to do it more generally, e.g the BQCESubroutine's IR can't handle threading or CUDA well at the moment.


so I think Expronicon is only useful if you will need to manipulate Expr, but since you are using generated function then prob it is not the case then.

chriselrod commented 3 years ago

so I think Expronicon is only useful if you will need to manipulate Expr, but since you are using generated function then prob it is not the case then.

Of course an @generated function needs an Expr in the end. But Expr and Expr-like data types aren't good for analyzing loops.

So LoopVectorization's needs may not have much overlap (meaning not much code reuse potential) by codegen living inside Expronicon?

The future representation will consist of the following components:

  1. Loops will be represented by a set of "statements". Each statement corresponds to either a store (e.g. C[m,n] = x) or an update of a variable not initialized inside the loops (e.g. s += b, where there is no statement like s = 0.0 inside the loops). Each statement is a collection of operations (e.g., maybe x = w + c, where w = W[i], etc). The same operation may appear in multiple different statements.
  2. Dependencies between statements.
  3. Dependency graph between individual operations.
  4. Loop bounds.

The current LoopSet is missing "2." and "1." is implicit rather than explicit.

I'd like an eventual direct lowering to LLVM, so maybe some cooperation on the Yuan side of things could help.

Roger-luo commented 3 years ago

Of course an @generated function needs an Expr in the end. But Expr and Expr-like data types aren't good for analyzing loops.

Ah, right, when you say generated function I thought you are planning to work on CodeInfos.

So LoopVectorization's needs may not have much overlap (meaning not much code reuse potential) by codegen living inside Expronicon?

The goal of Expronicon is just a place to collect some common metaprogramming code, and I'm planning to move the loop related meta programming tools I wrote inside BQCESubroutine into Expronicon, e.g tools for hoisting constants inside the kernel to outside https://github.com/QuantumBFS/BQCESubroutine.jl/blob/master/src/utils.jl#L45 I guess this could also be useful for LoopVectorization, but need to make it more general.

I feel things like LoopVectorization, or say YaoCompiler, KernelCompiler etc. will have to insert compiler plugin at all level of Julia compilation eventually. That's why I'm splitting things out of the YaoCompiler tree, one reason is to get better testing, the other is to make it useful for others too.

Dependencies between statements.

FYI. If you want to analyze variable dependency at Expr level, https://github.com/JuliaStaging/JuliaVariables.jl might be useful

Loops will be represented by a set of "statements". Each statement corresponds to either a store (e.g. C[m,n] = x) or an update of a variable not initialized inside the loops (e.g. s += b, where there is no statement like s = 0.0 inside the loops). Each statement is a collection of operations (e.g., maybe x = w + c, where w = W[i], etc). The same operation may appear in multiple different statements.

do you mean to perform the duplicated call elimination at Julia Expr level? I tried to do it in BQCESubroutine for the tiny DSL I defined, but I don't know the best way to do it in the general case, besides, LLVM does it I think, it's just for my special case, I want to generate precisely the optimal code.

Tokazama commented 2 years ago

Is there any way we could represent loops as simple types that could be optimized by additional passes from downstream packages like LoopVectorization? ArrayInterface can break an expression like getindex(A, inds1, inds2) down to this...

getindex(A, inds1, inds2) = _getindex(buffer(A), StrideIndex(A), inds1, inds2)
function _getindex(buf, idx, inds1, inds2)
    dest = similar(buf, inds1, inds2)
    dest_inds = eachindex(dest)
    itr = iterate(dest_inds)
    for i2 in inds2
        for i1 in inds1
            idx, state = itr
            dest[idx] = buf[idx[i1, i2]]
            itr = iterate(itr)
        end
    end
    return dest
end

We want something to rewrite these loops taking into account idx (among other things), but once we write the code like this I think the only way to do that is with something like Cassette. If we had something more portable (perhaps like a Base.Generator produced when calling (buffer[x, y] for x in indices(buffer, 1), y in indices(buffer, 2))) then it would be easier to have lower level libraries write functions in a way that could later be optimized. I was thinking something like what is done in Transducers.jl and Base.Iterators where we iterate over a collection and have a fairly good idea of what the operation is. We could add an additional metadata field that provides info for the loop. So the above example could be something like...


struct EachIndex{B,I,M}
    buffer::B
    indices::I tt
    metadata::M

    function EachIndex(buffer, indices)
        EachIndex(buffer, indices, ArrayIndex(buffer))
    end
end

struct CopyTo!{D<:EachIndex,S<:EachIndex}
    destination::D
    source::S
end

function getindex(A, inds1, inds2)
    dest = similar(A, inds1, dins2)
    collect(CopyTo!(
      EachIndex(dest,  indices(dest)),
      EachIndex(A, Iterators.product(inds1, inds2))
    ))
end

So instead of going straight into a loop IR interface we would have something that hands off important traits/metadata/etc so the IR stuff is more straightforward.