JuliaArrays / StaticArrays.jl

Statically sized arrays for Julia
Other
761 stars 147 forks source link

Various functions don't have cutoffs for when to stop unrolling #439

Open tkoolen opened 6 years ago

tkoolen commented 6 years ago

From https://github.com/JuliaArrays/StaticArrays.jl/issues/430#issuecomment-395977617, e.g.:

julia> using StaticArrays

julia> SA = rand(SMatrix{50,50,Complex{Float64}});

julia> @time vecnorm(SA);
 83.789766 seconds (137.97 M allocations: 3.949 GiB, 1.25% gc time)

There are many more, for example map, mapreduce and broadcast.

andyferris commented 6 years ago

For some historical context, the general rule of thumb has been if the generated code is O(length(a)) then we always unroll (remember it's O(length(a)) code just to define the tuple, it takes up O(length(a)) space on the stack, etc). For things like matrix multiplication we add cutoffs.

When/if we start experimenting with allocation elimination in v1.x these O(length(a)) in theory methods can be replaced with O(1) code via loops and mutation, with unrolling for small length(a).

tkoolen commented 6 years ago

I think the way _vecnorm is currently implemented, compile time does not scale linearly, but it could of course.

andyferris commented 6 years ago

I didn't see it? Are you referring to the recursive way the expression is constructed here? (Expr is not an isbitstype, and AFAIK interpolation should just be putting a pointer into a new small expression at every iteration, and thus be O(N) though with O(N) allocations and a "linked-list" overall structure).

I guess the compiler itself might not be able to deal with the resulting method in linear time... but in theory at least I guess that it should be able to do so.

tkoolen commented 6 years ago

Ah, I guess you're right. I thought that

https://github.com/JuliaArrays/StaticArrays.jl/blob/7ddb8e4a02656d0816a53a6d819e1384923132e5/src/linalg.jl#L287-L289

would allocate a new Vector containing :+ and its (growing number of) arguments at each iteration (as the args field of an expression), but

expr = :(a + b)
:($expr + c)

becomes :((a + b) + c), not :(a + b + c) as I thought. Maybe the deep expression nesting is a problem for the compiler then?

andyferris commented 6 years ago

I’m not sure. The new lowered IR is linear (intermediate expressions are SSA assignments) - I wonder how fast the transformation is? OTOH I think +(1,2,3,4,5,6,7,8) might be slower on v0.6. Even on v0.7 +(...) might take a while to compile (and potentially execute).

tkoolen commented 6 years ago

Yeah, changing things so that it actually does become +(1,2,3,4,5,6,7,8) doesn't look like a good idea.

In any case, seems to me like this should just be a call to mapreduce, after which this discussion still applies to _mapreduce:

https://github.com/JuliaArrays/StaticArrays.jl/blob/7ddb8e4a02656d0816a53a6d819e1384923132e5/src/mapreduce.jl#L92-L96

So given the linear IR in 0.7, am I right in thinking that something more like Base.mapreduce, https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/base/reduce.jl#L192-L197, where a value is updated repeatedly, would be the same thing for the compiler? Anyway, I'll try something like that later.

By the way, do you think the scaling done in LinearAlgebra.vecnorm2 (https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/stdlib/LinearAlgebra/src/generic.jl#L298-L322) is also needed in StaticArrays?

andyferris commented 6 years ago

So given the linear IR in 0.7, am I right in thinking that something more like Base.mapreduce, https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/base/reduce.jl#L192-L197, where a value is updated repeatedly, would be the same thing for the compiler?

Note that in SSA form, if we unroll that loop, the IR creates a new variable for every assignment. If the binding has the same name in your code, lowering will rename it (note that this has lots of advantages for the compiler - notably variables can now change type without causing type inference failure!). So yes it's the same after lowering.

Loops are different and tend to use something called phi nodes where bindings get "replaced" in every iteration of the loop. We'd have to experiment when a loop is preferable. (Some people would argue we should always use loops and let the compiler / LLVM choose when to unroll.)

By the way, do you think the scaling done in LinearAlgebra.vecnorm2 (https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/stdlib/LinearAlgebra/src/generic.jl#L298-L322) is also needed in StaticArrays?

Possibly? For my geometry work, I'm not sure it would be worth the overhead for me, but it really depends on the use case. This is one of the trickiest parts of StaticArrays - knowing what precision to keep vs. speed. Generally we've let Base and LinearAlgebra be precise and StaticArrays be fast. We could switch to precise by default and start using the @fastmath macros for speed, if people felt strongly about it.

tkoolen commented 6 years ago

Generally we've let Base and LinearAlgebra be precise and StaticArrays be fast

Yep, sounds good.

if people felt strongly about it

I definitely don't.

andreasnoack commented 2 years ago

There have been various off-line discussions of the latency issues with StaticArrays and the conclusion is essentially this issue, i.e. that the unrolling makes the code so big that it can be very slow for LLVM to consume. As shown in https://github.com/JuliaArrays/StaticArrays.jl/issues/631, the latency issues might be significant with Float64 elements but the (manual) unrolling strategy really falls short for larger element types. In our application we mix StaticArrays with ForwardDiff and that completely invalidates any heuristics used here for when to unroll. An example

julia> tnt = map(1:6) do n
           (
               n = n,
               ctime = let t = time()
                   x = randn(n^2)
                   v = ForwardDiff.hessian(x) do _x
                       _n = isqrt(length(_x))
                       A = SMatrix{_n,_n}(_x...)
                       y = @SVector randn(_n)
                       return y'*(lu(A'*A)\y)
                   end
                   time() - t
               end
           )
           end
6-element Vector{NamedTuple{(:n, :ctime), Tuple{Int64, Float64}}}:
 (n = 1, ctime = 0.9155991077423096)
 (n = 2, ctime = 1.306809902191162)
 (n = 3, ctime = 5.09375)
 (n = 4, ctime = 8.366412162780762)
 (n = 5, ctime = 23.909090995788574)
 (n = 6, ctime = 130.0966489315033)

(I have a suspicion that lu is one of the functions particularly prone to this)

Our application doesn't stop at 6 or at second order derivates so we have to spend a lot of time on reducing chunk sizes. That might be unavoidable but it would certainly help if the compiler had a chance to decide when to unroll. Hopefully it would be possible to completely drop the manual unrolling here (or at least limit it to methods restricted to Float32/Float64 elements) and instead switch to a loop based approach via MArray. The compiler might not yet be ready for this but I think it would be good to start exploring this and give examples where the compiler produces inefficient code. We will probably be able to help with such rewriting of the library if the approach is acceptable.

mateuszbaran commented 2 years ago

Improving that specific case wouldn't be too hard. Matrix multiplication already has fallbacks of different level of unrolling: https://github.com/JuliaArrays/StaticArrays.jl/blob/8ca11f871321c57cce2bfe3b9163317fa108f3c0/src/matrix_multiply.jl#L130 so very likely it would be enough to change sa[1] there to something like 8*sa[1]/sizeof(Ta) etc, and limit unrolling to isbits types.

Similarly lu and matrix division could just call the generic implementations for large matrices.

I could review such changes.

andyferris commented 2 years ago

Yes.

To be clear the current generated code was only ever meant to be the first cut, suitable for things like the 3-vectors and 3x3 matrices I was using at the time. All the functions should have fallbacks like matrix multiply does (which obviously scales worse so was more important to cover early on).

I haven’t looked into what’s possible now with mutating vs non-mutating approaches etc on the current compiler. Hopefully though you can keep say 10 to 100 values (especially Union{Missing, T}’s) on the stack and iterate over them (both reading and writing) and output an immutable vector with efficient codegen (small code, no extra allocations, usage of SIMD, etc). It’s been quite straightforward since forever to do such things with MVector but no one has put in the time. Personally I’m really hoping someone with the interest and time can dig in and find the best modern approach (I know it won’t be me).

andyferris commented 2 years ago

The one thing I wish Julia had to support this nicely was mutable version of tuple. The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically).

KristofferC commented 2 years ago

The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically).

Converting to an MArray, mutating, and then converting back usually works well and does not allocate, as long as things get inlined properly.

chriselrod commented 2 years ago

Except that there is a compiler issue currently where the conversions are not no-ops, but actually involve separate (stack) allocations and a fully unrolled load/store between them.