aggregate_neighbors() is 100x slower than equivalent sparse matrix operation

learning-chip commented 2 years ago

Although #106 has been solved by fusion #108, the slowness of the unfused implementation (apply_edges + aggregate_neighbors) was not clearly understood. Realistic GNN models would contain mixed calls to message function, reduce function, and neural network layers, so they don't always exhibit a nice form for #108 to work.

Profiling with ProfileSVG.jl shows that 60% of time was spent on aggregate_neighbors:

prof_spmm

Neighbor reduction with + is equivalent to either:

SpMV e * A where e is a unit vector, or
column sum of sparse matrix A

Either way turns out to be more than 100x faster than aggregate_neighbors.

Reproducible example

using SparseArrays
using GraphNeuralNetworks
import Random: seed!
using BenchmarkTools

n = 1024
seed!(0)
A = sprand(n, n, 0.01)

g = GNNGraph(
    A,
    edata=(; A=reshape(A.nzval, 1, :)),
    graph_type=:coo
)

out = aggregate_neighbors(g, +, g.edata.A)
@btime aggregate_neighbors(g, +, g.edata.A)  # ~4 ms

e = ones(1, n)
e * A == out  # true
@btime e * A  # ~20 us

sum(A, dims=1) ≈ out  # true
@btime sum(A, dims=1)  # ~10 us

Here only uses a single edge feature. Multiple edge features would correspond to a 3D sparse tensor that is not supported by SparseArrays.jl -- TACO could be used then.

Package version

GraphNeuralNetworks@0.3.11
Julia 1.7.1

CarloLucibello commented 2 years ago

Nice! The comparison is not totally fair because g.edata could contain an arbitrary 1 x num_edges feature array e, therefore should first construct a matrix A out of e and then perform sum(A, dims=1). But also introducing this correction I guess it is gonna be much much faster than what aggregate_neighbors currently does.

aggregate_neighbors is just NNlib.scatter, so whatever fix we come up with should be implemented there.

I think we should introduce a fast path scatter for the case op=+ and size(e, 1) == 1 doing sum(A, dims=1). It's a bit annoying having many special cases but it is worth doing for relevant uses cases until we have better generic solutions.

AriMKatz commented 2 years ago

until we have better generic solutions.

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

CarloLucibello commented 2 years ago

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

the first step would be to integrate https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl in here for fast and parallelized sparse matrix operations. Then there is a lot of work to be done with sparse matrices support in CUDA.jl, there are a few issues open over there. Finally maybe develop sparse arrays in arbitrary dimensions and corresponding operations (see TACO)

CarloLucibello commented 2 years ago

Solving this issue was actually quite easy:

@btime aggregate_neighbors(g, +, g.edata.A) 
# 3.773 ms (52848 allocations: 3.39 MiB)   # on NNlib master
# 59.660 μs (3 allocations: 8.23 KiB)          # with https://github.com/FluxML/NNlib.jl/pull/384

CarloLucibello commented 2 years ago

Closing this as the NNlib PR is being merged

CarloLucibello / GraphNeuralNetworks.jl

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

Reproducible example

Package version