CarloLucibello / GraphNeuralNetworks.jl

Graph Neural Networks in Julia
https://carlolucibello.github.io/GraphNeuralNetworks.jl/dev/
MIT License
215 stars 46 forks source link

aggregate_neighbors() is 100x slower than equivalent sparse matrix operation #124

Closed learning-chip closed 2 years ago

learning-chip commented 2 years ago

Although #106 has been solved by fusion #108, the slowness of the unfused implementation (apply_edges + aggregate_neighbors) was not clearly understood. Realistic GNN models would contain mixed calls to message function, reduce function, and neural network layers, so they don't always exhibit a nice form for #108 to work.

Profiling with ProfileSVG.jl shows that 60% of time was spent on aggregate_neighbors:

prof_spmm

Neighbor reduction with + is equivalent to either:

Either way turns out to be more than 100x faster than aggregate_neighbors.

Reproducible example

using SparseArrays
using GraphNeuralNetworks
import Random: seed!
using BenchmarkTools

n = 1024
seed!(0)
A = sprand(n, n, 0.01)

g = GNNGraph(
    A,
    edata=(; A=reshape(A.nzval, 1, :)),
    graph_type=:coo
)

out = aggregate_neighbors(g, +, g.edata.A)
@btime aggregate_neighbors(g, +, g.edata.A)  # ~4 ms

e = ones(1, n)
e * A == out  # true
@btime e * A  # ~20 us

sum(A, dims=1) ≈ out  # true
@btime sum(A, dims=1)  # ~10 us

Here only uses a single edge feature. Multiple edge features would correspond to a 3D sparse tensor that is not supported by SparseArrays.jl -- TACO could be used then.

Package version

CarloLucibello commented 2 years ago

Nice! The comparison is not totally fair because g.edata could contain an arbitrary 1 x num_edges feature array e, therefore should first construct a matrix A out of e and then perform sum(A, dims=1). But also introducing this correction I guess it is gonna be much much faster than what aggregate_neighbors currently does.

aggregate_neighbors is just NNlib.scatter, so whatever fix we come up with should be implemented there.

I think we should introduce a fast path scatter for the case op=+ and size(e, 1) == 1 doing sum(A, dims=1). It's a bit annoying having many special cases but it is worth doing for relevant uses cases until we have better generic solutions.

AriMKatz commented 2 years ago

until we have better generic solutions.

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

CarloLucibello commented 2 years ago

What would those look like and at what layer in the stack? Should we open up some issues in GPUArrays or CUDA.jl? Or something in the #arrayIR slack channel?

the first step would be to integrate https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl in here for fast and parallelized sparse matrix operations. Then there is a lot of work to be done with sparse matrices support in CUDA.jl, there are a few issues open over there. Finally maybe develop sparse arrays in arbitrary dimensions and corresponding operations (see TACO)

CarloLucibello commented 2 years ago

Solving this issue was actually quite easy:

@btime aggregate_neighbors(g, +, g.edata.A) 
# 3.773 ms (52848 allocations: 3.39 MiB)   # on NNlib master
# 59.660 μs (3 allocations: 8.23 KiB)          # with https://github.com/FluxML/NNlib.jl/pull/384
CarloLucibello commented 2 years ago

Closing this as the NNlib PR is being merged