Broadcast and linear indexing

chethega commented 5 years ago

It would be nice if we could propagate IndexStyle for broadcasts. The issue is the following:

julia> using BenchmarkTools
julia> n=10_000; d=1; T=Int32; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  2.221 μs (0 allocations: 0 bytes)
  39.333 μs (0 allocations: 0 bytes)

julia> n=10_000; d=2; T=Float32; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  4.531 μs (0 allocations: 0 bytes)
  35.019 μs (0 allocations: 0 bytes)

julia> n=10_000; d=8; T=Float64; a=rand(T, d*n); b=rand(T, d, n); ac=copy(a); bc=copy(b); inc=Ref(T(3));
julia> @btime broadcast!(+, ac, a, inc); @btime broadcast!(+, bc, b, inc);
  59.652 μs (0 allocations: 0 bytes)
  88.321 μs (0 allocations: 0 bytes)

Currently, broadcasts always use cartesian indexing. This is slow and prevents a lot of simd.

In the relatively common case that dest and all args support linear indexing, and the only cases of dropped dimensions are zero-dimensional (as above), we should use linear indexing for a significant speedup (up to 20x), especially if the first dimension is small (which is a very common occurence).

mbauman commented 5 years ago

Broadcast supports non-AbstractArrays, so we can't ask all arguments for their IndexStyle. We also definitely cannot do this if any dimensions (other than 0) get "extruded". But that said, I think we could conservatively and statically identify more linear-capable broadcasts that are limited to known good cases. It's "as easy" as adding more eachindex methods.

At the same time, we should also add support for more array-like eachindex method support so folks can explicitly request IndexLinear or IndexCartesian if they need.

chethega commented 5 years ago

We could specialize on <:AbstractArray, <:Ref, <:Number, etc. A 20x speedup for <:AbstractArray without nonzero extruded dimensions is imho good enough to justify some code duplication.

I am half of a mind to propose IndexStyle(::Any) = IndexSequential() (or maybe IndexUnknown()), indicating that random access is bad or unsupported, and the iteration protocol should be used. Then, we could always query index styles of objects, and optimize a lot of unrelated code: Just like iterators indicate whether they have a known length and eltype, they would signal whether they like random access or need to carry more state in iteration.

It's "as easy" as adding more eachindex methods.

Ok, maybe I'm just blind, but how?

mbauman commented 5 years ago

Oh nevermind, I'm wrong. This can't be done type-stably because the decision to extrude is not in the type domain but on runtime values (sizes).

Thus we cannot have a type stable IndexStyle for broadcasts that returns IndexLinear for "safe" n-dimensional broadcasts. It'd have to be a runtime switch in a particular copyto! implementation... and we already are generating a number of for loops there for different optimizations. In fact, the addition of an extra for loop was precisely what stalled JuliaLang/julia#30973 — it significantly regressed compile times.

mbauman commented 5 years ago

Just to be clear (because obviously I forgot what restrictions we had here myself):

Broadcasting evaluates each element with bc[I]
If any extrusion occurs, broadcast absolutely needs to have the indices separated out into independent dimensions to remove/add/replace the extruded dimension as required
Extrusion is determined by run-time values. Broadcasting A .+ B where A and B are both Matrixes isn't safe to do linearly because it might need to swap out one or both of the indices in one or both of the arrays — and that computation requires cartesian indices.

There is still a case where this would be type-stable: it's where there is only one non-zero-dimensional argument and it's IndexLinear and all other arguments are zero-dimensional. In other words, it would apply to none of your original examples (since the RHS could be extruded into the LHS).

chethega commented 5 years ago

Extrusion is determined by run-time values. Broadcasting A .+ B where A and B are both Matrixes isn't safe to do linearly because it might need to swap out one or both of the indices in one or both of the arrays — and that computation requires cartesian indices.

In other words, it would apply to none of your original examples (since the RHS could be extruded into the LHS).

In fact, the addition of an extra for loop was precisely what stalled JuliaLang/julia#30973 — it significantly regressed compile times.

Merde, this is bad. After reading up on that PR of yours, I don't know what to do about this. Thanks for the explanation!

In 2.0, we should consider making broadcast extrusion explicit (i.e. missing dims get silently extruded such that array .+ scalar continues to work, but A .+ B of matrices would fail when sizes don't coincide).

mbauman commented 5 years ago

I mean, it's a tradeoff. It would suck to lose the ability to do things like A .- mean(A, dims=1) or A ./ sum(A, dims=1).

It is something I've considered, but without the ability to encode singleton dimensions in the type system I think it's a non-starter. Doing this generally was something we talked about in JuliaLang/LinearAlgebra.jl#42 but dismissed as being far too complicated. The introduction of an orthogonal syntax like f.(A, ^B) or f.(A, ⟂B) could ameliorate some of the pain, but I still don't think it'd be worth the massive breakage.

maleadt commented 5 years ago

Too bad, I had hoped for something similar for GPU arrays since the run-time index calculations are pretty costly there too.

chethega commented 5 years ago

You're right, some of the use cases for the current extrusion behavior are compelling.

So I guess we could prepare a PoC branch that emits both linear and cartesian indexing, with a single hoisted runtime check. That would blow up codesize and compiler latency, but improve runtime perf in most cases. Having such a branch ready would allow us to make more informed decisions on the tradeoffs (code complexity, readability of compilation output, compiler latency, runtime performance), and may be useful for people who prioritize runtime over compile-time perf. Also, the tradeoff may change in the future (compiler latency could get better).

It may be useful to also offer a keyword arg to broadcast / broadcast! that makes extrusions explicit? Something like extrusions = Val(((false,false), (true, false), ())) for NxN .+ 1xN .+ scalar. Keyword args to broadcasts could also allow other fast operations that we currently cannot support, like broadcast(args...; mask = mask::Base.LogicalIndex) (such ops currently need to construct views that incur a large temporary).

JuliaLang / julia

Broadcast and linear indexing #32051