RFC: Facilitate not paying double lookup cost when mutating a value in a LinearSlow AbstractArray

KristofferC commented 8 years ago

There are many cases where one wants to modify a value in an array like A[i, j] = f(A[i, j]), If A is a LinearSlow array this would currently need two transformation between the Cartesian index and the linear index, one in getindex and one in setindex!.

A typical real world example is in Finite Element Analysis where one wants to "assemble" a local small dense matrix Ke into a large sparse matrix K

function assemble!(K, Ke, dofs)
    for (loc1, glob1) in enumerate(dofs)
        for (loc2, glob2) in enumerate(dofs)
            K[glob1, glob2] += Ke[loc1, loc2]
        end
    end
end

Since K[i,j] += Ke[a,b] lowers to K[i,j] = K[i,j] + K[a,b] we will do the Cartesian to linear index searches twice. Typically in Finite Element codes to avoid this, one usually creates a add(K, v, i::Integer, j::Integer) method one sub(K, v, i::Integer, j::Integer) etc, see for example https://github.com/dealii/dealii/blob/0cb18a1aba58177c59bfe3755d13e67cc6099244/examples/step-5/step-5.cc#L391.

It could perhaps be useful for all AbstractArrays to optionally support a fast version of A[i, j] = f(A[i, j]), lets call it mutateindex!

mutateindex! would have the signature mutateindex(A, f, i...) where f is a function (v) -> f(v) where v is the value of A[i...]

A default fall back for mutateindex would be simple:

function mutateindex!(A, f, i...)
    A[i...] = f(A[i...])
end

Optionally an AbstractArray could implement a fast version of mutateindex, like for a SparseMatrix:

import Base.Order.Forward

function mutateindex!{T,Ti}(A::SparseMatrixCSC{T,Ti}, f, i0::Integer, i1::Integer)
    i0 = convert(Ti, i0)
    i1 = convert(Ti, i1)
    if !(1 <= i0 <= A.m && 1 <= i1 <= A.n); throw(BoundsError()); end
    r1 = Int(A.colptr[i1])
    r2 = Int(A.colptr[i1+1]-1)
    i = (r1 > r2) ? r1 : searchsortedfirst(A.rowval, i0, r1, r2, Forward)

    if (i <= r2) && (A.rowval[i] == i0)
        A.nzval[i] = f(A.nzval[i])
    else
        insert!(A.rowval, i, i0)
        insert!(A.nzval, i, f(zero(T)))
        @simd for j = (i1+1):(A.n+1)
            @inbounds A.colptr[j] += 1
        end
    end
    return A
end

We could now write our assembly function again as:

add!(A::AbstractMatrix, v, i0::Integer, i1::Integer) = mutateindex!(A, (i) -> i + v, i0, i1)

function assemble2!(K, Ke, dofs)
    for (loc1, glob1) in enumerate(dofs)
        for (loc2, glob2) in enumerate(dofs)
            add!(K, Ke[loc1, loc2], glob1, glob2) 
        end
    end
end

and by doing some benchmarks:

function benchmark(K, Ke, dofs)
    @time for i in 1:10^4
        assemble!(K, Ke, dofs)
    end

    @time for i in 1:10^4
        assemble2!(K, Ke, dofs)
    end
end

const Ke = rand(8,8)
const dofs = [20,21, 500, 501, 1001, 1002, 1086, 1087]
const K = sprand(10^5, 10^5, 0.001)
assemble!(K, Ke, dofs)

julia> benchmark(K, Ke, dofs)
  0.033032 seconds
  0.017603 seconds

we see that we are now about twice as fast due to halving the dominating lookup cost.

It would of course be nice to have some good syntax for this, which ties on to https://github.com/JuliaLang/julia/issues/249.

We could also just have setindex!(A, v, i, j) be defined as `mutateindex!(A, (_) -> v, i...) but that is probably too disruptive.

Maybe our AbstractArray Super Saiyans @timholy and @mbauman has any comments on the feasibility of something like this.

mbauman commented 8 years ago

This isn't unique to arrays: https://github.com/JuliaLang/julia/issues/2108. The non-scalar case is also interesting, since A[:] += 1 allocates two temporary arrays (or one with views). If it lowered to setindex!(A, x->x+1, :), it wouldn't need to allocate any. I've not thought through the ramifications… but it's definitely interesting. Edit: of course, that won't work with an expression like A[:] += 1:length(A).

timholy commented 8 years ago

It's definitely feasible, and obviously the scalar case is easier than the non-scalar case because you don't (usually?) have to worry about aliasing, which IIUC is the hardest part about automatic devectorization.

For types where getindex/setindex! get inlined, if I'm reading this right, LLVM already does this for us:

julia> B = rand(6,7);

julia> A = sub(B, 1:5, 1:5);

julia> foo!(A, i, j, v) = (@inbounds A[i,j] = A[i,j] + v; A)
foo! (generic function with 1 method)

julia> @code_native foo!(A, 2, 3, -20)
        .text
Filename: none
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        movq    $0, -32(%rbp)
        movq    $0, -24(%rbp)
        movq    $0, -16(%rbp)
        movq    $0, -8(%rbp)
        movq    $14, -72(%rbp)
        movabsq $jl_tls_states, %r8
        movq    (%r8), %rax
        movq    %rax, -64(%rbp)
        leaq    -72(%rbp), %rax
        movq    %rax, (%r8)
Source line: 145
        movq    %rdi, -56(%rbp)
        movq    24(%rdi), %r9
Source line: 419
        addq    8(%rdi), %rsi
        movq    %rdi, -48(%rbp)
        movq    (%rdi), %rax
        movq    %rax, -40(%rbp)
        leaq    -2(%rdx,%r9), %rdx
        imulq   24(%rax), %rdx
        addq    %rsi, %rdx
        movq    (%rax), %rsi
        vmovsd  -16(%rsi,%rdx,8), %xmm0 # xmm0 = mem[0],zero
Source line: 146
        vcvtsi2sdq      %rcx, %xmm0, %xmm1
        vaddsd  %xmm0, %xmm1, %xmm0
Source line: 178
        movq    %rdi, -32(%rbp)
Source line: 419
        movq    %rdi, -24(%rbp)
        movq    %rax, -16(%rbp)
        vmovsd  %xmm0, -16(%rsi,%rdx,8)
Source line: 179
        movq    %rdi, -8(%rbp)
        movq    -64(%rbp), %rax
        movq    %rax, (%r8)
        movq    %rdi, %rax
        popq    %rbp
        retq
        nopl    (%rax)

Obviously, for some array types (SparseMatrixCSC is indeed a great example) this doesn't happen.

Another potential way to achieve the same thing is through #15434 and whatever #15459 evolves into: if each array gets a custom iterator that provides trivial (i.e., inline-worthy) access to the data, then access cost is greatly reduced. My main concern is in figuring out how to do #15459 without uglifying everything. Maybe we can continue this conversation in a couple of days? I'm hoping to finish up a julep/blog post/whatever that goes more completely into my thoughts (which unfortunately are still vague and somewhat short when it comes to actual solutions).

mbauman commented 8 years ago

Changing the lowering of A[] op= v actually seems pretty promising to me. It could lower to updateindex!(A, op, v, i...), which could have a default fallback to setindex!(A, op(A[i...], v), i...). That'd be semantically identical to the status quo, just with a hook for types to allow specialization. It doesn't solve the dict get-or-set use case, but that strikes me as substantially different since the RHS isn't always evaluated.

Another alternative would be to have an idiom for converting arbitrary integer indices into efficient pre-looked-up objects that support indexing. It'd be like a conversion utility to go from tuples of ints to the fast iterators. And that's akin to https://github.com/JuliaLang/julia/issues/12157.

eschnett commented 8 years ago

updateindex! opens the door to having collections where a[...]+=b has a different meaning to a[...] = a[...]+b. For example -- just wildly speculating here -- one could create a new version of a nullable type where += automatically fills in missing values, using 0 as fallback value for += and using 1 as fallback value for *=.

A natural extension to updateindex! would be update! for non-array types, so that one can implement a more efficient version of += e.g. for a new matrix type that might not be an alias for Array.

KristofferC commented 8 years ago

The discussion in #249 seemed to reach the decision that different behaviors for x = x + a and x += a is confusing. It does have precedence in for example Python though.

However I believe there should be some way to efficiently update an index. The only way I currently now is by copy pasting the setindex! and modifying it.

KristofferC commented 8 years ago

This button...

mbauman commented 8 years ago

Yeah, lets leave the scalar x+=y case in #249. Indexed mutation is very different because it's explicitly mutating. It occurs to me, though, that changing lowering would mean that the expression A[i,j] += 1 would no longer automatically return its RHS for things like chained assignments. I suppose you could also state that updateindex! must return its RHS, but then that defeats the non-scalar optimization. It's messier than I had hoped.

You could still pre-compute a fast index, which for SparseMatrices could be an immutable that keeps track of both the cartesian indices and the index into rowval/nzval, preventing the double searchsorted:

I = fastindex(K, glob1, glob2)
K[I] += Ke[loc1, loc2]

KristofferC commented 8 years ago

I've been pondering on this a bit more while dogfooding my own array package: https://github.com/KristofferC/BlockArrays.jl and came to the conclusion that being able to dispatch on += to update an array would in my view be very useful and a fastindex index with O(1) lookup into a LinearSlow array is not enough.

As an example, let's say I have a 2x2 blockmatrix in the format of something I call a PseudoBlockArray where the underlying array is just wrapped and stored contiguously.

julia> using BlockArrays

julia> A = PseudoBlockArray(zeros(4,4), [2,2], [2,2])
2×2-blocked 4×4 BlockArrays.PseudoBlockArray{Float64,2,Array{Float64,2}}:
 0.0  0.0  │  0.0  0.0
 0.0  0.0  │  0.0  0.0
 ----------┼----------
 0.0  0.0  │  0.0  0.0
 0.0  0.0  │  0.0  0.0 ```

Now, I want to add soemthing to the top right block:

```jl
julia> A[Block(1,2)] += ones(2,2); A
2×2-blocked 4×4 BlockArrays.PseudoBlockArray{Float64,2,Array{Float64,2}}:
 0.0  0.0  │  1.0  1.0
 0.0  0.0  │  1.0  1.0
 ----------┼----------
 0.0  0.0  │  0.0  0.0
 0.0  0.0  │  0.0  0.0

The problem here is that A[Block(1,2)] is an allocating function since it returns a new copy of the block! The fastindex index does not solve this. I instead want to dispatch on the += and update the top right block in place. Right now I just write an updateblock! function but it would be much nicer if I could dispatch to an updating function.

KristofferC commented 8 years ago

I made a small package here that uses a macro to achieve what I want: https://github.com/KristofferC/UpdateIndex.jl

fredrikekre commented 6 years ago

This would be nice to have. I am not sure if this would have to be breaking, and/or if something is needed to do pre 1.0 in order to implement it after? Personally, I think that https://github.com/JuliaLang/julia/issues/15630#issuecomment-201855260 would be a neat solution.

j-fu commented 4 years ago

Well, would like to bump this - it really is a performance issue in particular for 3D FE computations, and we see custom solutions spreading over various packages, see e.g. @KristofferC 's and mine ...

KristofferC commented 4 years ago

it really is a performance issue in particular for 3D FE computations

FWIW, for FE assembly I found it in general faster to just sort the degrees of freedom that you are going to assemble and then for each column in the sparse array just pass through the rows linearly until you have matched the correct number of times:

https://github.com/KristofferC/JuAFEM.jl/blob/5867e06faef8e8c86076c176f30357f8526c107d/src/assembler.jl#L122-L141

This could be slower if you have elements with a huge number of degrees of freedom but for my cases, it has been quite fast compared to doing the binary search over and over.

j-fu commented 4 years ago

Yeah I am quite sure this is faster.

OTOH this means you need to separate pattern creation and assembly. A priori pattern creation becomes quite messy for strongly coupled nonlinear PDEs , so I wanted to have a matrix struct where just saying a[i,j]+=v performs reasonably well. In fact, I used this idea before in C++ .

But your remark triggers some thinking....

KristofferC commented 4 years ago

OTOH this means you need to separate pattern creation and assembly

Yes, that is a good observation.

oscardssmith commented 4 years ago

Is there a significant issue with the updateindex! suggestion made by @mbauman? That to me looks clean, and it absolutely could lead to unintuitive behavior, but from what I've seen, the general sentiment in the Julia community has been to allow but discourage doing stupid things.

tkf commented 4 years ago

I think Referenceables.jl API solves this problem. It's something like this:

using Referenceables: referenceable

# Convert an Array-of-T `A` to an Array-of-Ref{T} `B`
B = referenceable(A)

# Get Ref{T} by indexing into `B`.  The conversion to fast "intrinsic"
# indexing can be done at this point.
ref = @inbounds B[i ,j]

# Mutating `ref` would mutate `B[i, j]` without paying the
# Cartesian-to-linear conversion cost.  (Bonus: no `@inbounds`)
ref[] += inc

As an extra benefit, this idea works very well with parallel foreach: https://tkf.github.io/ThreadsX.jl/dev/#ThreadsX.foreach

KristofferC commented 4 years ago

I think Referenceables.jl API solves this problem.

Ref https://github.com/JuliaLang/julia/issues/24454

tkf commented 4 years ago

Yes, ref thing is pretty much the token API.

JuliaLang / julia

RFC: Facilitate not paying double lookup cost when mutating a value in a LinearSlow AbstractArray #15630