propagate() is 20x slower than built-in sparse matmul

learning-chip commented 2 years ago

With the well-known graph-matrix duality (see GraphBLAS intro, Fig. 1), simple graph message passing kernels are equivalent to sparse matrix-vector multiplication (SpMV) or sparse matrix-matrix multiplication (SpMM). However, I notice that propagate() is more than 20x slower than the built-in A * B for an equivalent operation. I did the same test with DGL, and did not observe such drastic slow down.

To reproduce

using SparseArrays
using GraphNeuralNetworks
using BenchmarkTools
import Random: seed!

n = 1024
seed!(0)
A = sprand(n, n, 0.01)
b = rand(1, n)
B = rand(100, n)

g = GNNGraph(
    A,
    ndata=(; b=b, B=B),
    edata=(; A=reshape(A.nzval, 1, :)),
    graph_type=:coo  # changing to :sparse has little effect on performance
)

function spmv(g)
    propagate(
        (xi, xj, e) -> e .* xj ,  # same as e_mul_xj
        g, +; xj=g.ndata.b, e=g.edata.A
        )
end

function spmm(g)
    propagate(
        (xi, xj, e) -> e .* xj ,  # same as e_mul_xj
        g, +; xj=g.ndata.B, e=g.edata.A
        )
end

isequal(spmv(g),  b * A)  # true
@btime spmv(g)  # ~5 ms
@btime b * A  # ~32 us

isequal(spmm(g), B * A)  # true
@btime spmm(g)  # ~9 ms
@btime B * A  # ~400 us

Such performance gap can't be explained by storing the sparse matrix in COO (GNN libraries' default) vs CSR (SciPy default) vs CSC (Julia default). In the code below, changing scipy matrix format has minor effect on speed. Also, the speed of DGL and SciPy are similar.

Compare with DGL and SciPy

import numpy as np
import scipy.sparse as sp
import torch

import dgl
import dgl.function as fn

n = 1024

np.random.seed(0)
A = sp.random(n, n, density=0.01, format='csc')  # changing format to `coo` or `csr` affects performance, but not much
b = np.random.rand(1, n)
B = np.random.rand(100, n)

g = dgl.from_scipy(A)
g.edata['A'] = torch.tensor(A.data[:, np.newaxis])
g.ndata['b'] = torch.tensor(b.T)
g.ndata['B'] = torch.tensor(B.T)

def spmv(g):
    with g.local_scope():
        g.update_all(fn.e_mul_u('A', 'b', 'm'), fn.sum('m', 'bA'))
        return g.ndata['bA']

def spmm(g):
    with g.local_scope():
        g.update_all(fn.e_mul_u('A', 'B', 'M'), fn.sum('M', 'BA'))
        return g.ndata['BA']

np.array_equal(spmv(g).numpy().T, b @ A)  # True
%timeit spmv(g)  # ~200 us
%timeit b @ A  # ~70 us

np.array_equal(spmm(g).numpy().T, B @ A)  # True
%timeit spmm(g)  # ~900 us
%timeit B @ A  # ~900 us

Effect of fusion

DGL's update_all fuses the message and reduction kernels. To mimic the two-stage propagate, and see if fusion causes such performance difference:

def spmv_twostage(g):
    with g.local_scope():
        g.apply_edges(fn.e_mul_u('A', 'b', 'm'))
        g.update_all(
            fn.copy_e('m', 'm'),
            fn.sum('m', 'bA')
        )
        return g.ndata['bA']

%timeit spmv_twostage(g)  # ~240 us; just 20% slower

The unfused version is only slightly slower. It cannot explain the 5ms vs 200us performance gap.

There must be other causes of inefficiency. I'd like to figure it out and bring the performance at least close to DGL. (I use DGL a lot, but there are certain projects that favor all-Julia implementation, and your package seems a good option with a clean syntax🙂 )

Package version

Jl:

GraphNeuralNetworks.jl 0.3.8
Julia 1.7

Py:

DGL 0.7.1
PyTorch 1.9.1

learning-chip commented 2 years ago

Also compared with PyG 2.0.1, which shows good performance with a single node feature b, but much slower performance with multiple edge features B.

An interesting observation here is that, the PyG SpMM scales almost linearly with the size (# of rows) of B. Multiplying a size (100, n) matrix is just 100x slower than multiplying a size (1, n) vector. Feels like spmv is called 100 times, rather than a single spmm kernel with better data locality.

(Recall that DGL performs well on both spmv and spmm cases, and is still the best reference)

Reproducer:

import numpy as np
import scipy.sparse as sp
import torch

from torch_geometric.data import Data
from torch_geometric.nn import MessagePassing

n = 1024

np.random.seed(0)
A = sp.random(n, n, density=0.01, format='coo')
b = np.random.rand(1, n)
B = np.random.rand(100, n)

edge_index = torch.tensor([A.row, A.col], dtype=torch.long)
data = Data(edge_index=edge_index, A=torch.tensor(A.data),
            B=torch.tensor(B), b=torch.tensor(b))

class SpMVLayer(MessagePassing):
    def __init__(self):
        super().__init__(aggr="add", node_dim=-1)

    def forward(self, x, edge_index, A):
        return self.propagate(edge_index, x=x, A=A)

    def message(self, x_j, A):
        m = x_j * A
        return m

spmv = SpMVLayer()

BA_gnn = spmv(data.B, data.edge_index, data.A)
np.array_equal(BA_gnn.numpy(), B @ A)  # True
%timeit spmv(data.B, data.edge_index, data.A)  # ~10ms, 10x slower than SciPy

bA_gnn = spmv(data.b, data.edge_index, data.A)
np.array_equal(bA_gnn.numpy(), b @ A)  # True
%timeit spmv(data.b, data.edge_index, data.A)  # ~100us, as fast as SciPy

CarloLucibello commented 2 years ago

Thanks, this benchmark is very useful. For this specific operation of the e_mul_xj type, we can implement a specialized fusing propagate, similarly to what we currently have for copy_xj. In fact for copy_xj the performance is on-par with matrix multiplication:


function spmm_copyxj_fused(g)
    propagate(
        copy_xj,
        g, +; xj=g.ndata.B
        )
end

function spmm_copyxj_unfused(g)
    propagate(
        (xi, xj, e) -> xj,
        g, +; xj=g.ndata.B
        )
end

Adj = map(x -> x > 0 ? 1 : 0, A)
@assert spmm_copyxj_unfused(g) ≈ B * Adj
@assert spmm_copyxj_fused(g) ≈ B * A # bug for weighted graphs fixed in #107

@btime spmm_copyxj_fused(g)  # 268.614 μs (22 allocations: 1.13 MiB)
@btime spmm_copyxj_unfused(g)  # 4.263 ms (52855 allocations: 12.23 MiB)
@btime B * Adj  # 196.135 μs (2 allocations: 800.05 KiB)

We can get more generic fusing integrating https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl as discussed a bit with @Wimmerer.

As to why the unfused operation based on gather/scatter are so slow and if can be improved, this has to be analyzed

learning-chip commented 2 years ago

We can get more generic fusing integrating https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl as discussed a bit with @Wimmerer.

Thanks, this seems a very promising direction -- is there a dedicated issue to plan and track it? #19 ?

CarloLucibello commented 2 years ago

yes we could open a separate issue. I'm not familiar yet with SuiteSparseGraphBLAS so I didn't think about a proper integration plan. The first step would just be a drop-in sparse matrix replacement as in #19 I guess

rayegun commented 2 years ago

I'm beginning to work on this area again. The primary issue is still missing rrules for the min/max reduction cases. But I'll make a PR without those by the end of the week.

Those missing rrules are possible, and I'm tackling that issue from 3-4 directions.

CarloLucibello commented 2 years ago

We don't necessarily have to takle all together. We can dispatch just on + at the beginning and use gather/scatter for the other reductions

CarloLucibello commented 2 years ago

With #107 the example in the OP now gives good results

function spmm_unfused(g)
    propagate(
        (xi, xj, e) -> e .* xj ,  # same as e_mul_xj
        g, +; xj=g.ndata.B, e=g.edata.A
        )
end

function spmm_fused(g)
    propagate(
        e_mul_xj,
        g, +; xj=g.ndata.B, e=vec(g.edata.A)
        )
end

@assert isequal(spmm_fused(g), B * A)  # true
@assert isequal(spmm_unfused(g), B * A)  # true
@btime spmm_fused(g)  #   271.228 μs (33 allocations: 1.21 MiB)
@btime spmm_unfused(g)  # 4.324 ms (52855 allocations: 12.23 MiB)
@btime B * A  # 196.115 μs (2 allocations: 800.05 KiB)

learning-chip commented 2 years ago

Confirmed the speed-up from fusion with #107. Thanks for such quick response.

spmv kernel now takes 200 us (vs 5ms previously, vs 32 us for builtin b * A)
spmm kernel now takes 600 us (vs 9ms previously, vs 400 us for builtin B * A)

Also notice that, the speed-up is gone with passing e=g.edata.A (size (1, num_edges) matrix) instead of e=vec(g.edata.A) (1D vector). This is understandable, because multiple edge features represented by a (num_features, num_edges) (dense) matrix correspond to a 3D sparse tensor of shape (num_features, row, col), which is beyond the common, 2D sparse linear algebra.

Nonetheless, the case of multiple edge features could have a large room for performance optimization. A simple reasoning is: we know that the special case num_features=1 (without flattening to 1D vector) is not optimized (unfused), resulting in 20x performance gap from optimal. The general num_features > 1 case should inherit the same source of inefficiency, compared to specialized sparse tensor library like TACO, or just compared to calling the built-in B * A spmm kernel num_features times.

3D sparse tensor is outside of standard GraphBLAS, so I doubt if SuiteSparseGraphBLAS.jl works for such case?

CarloLucibello commented 2 years ago

For multiple edge features, one could use NNlib.batched_mul or Tullio.jl, but the main problem is that julia doesn't have well supported sparse tensor types AFAIK

rayegun commented 2 years ago

3D sparse tensor is outside of standard GraphBLAS, so I double if SuiteSparseGraphBLAS.jl works for such case?

No it doesn't. We would flatten when using GraphBLAS for that case.

specialized sparse tensor library like TACO

That's what TACO generating Julia is for :), which is pretty close!

For multiple edge features, one could use NNlib.batched_mul or Tullio.jl, but the main problem is that julia doesn't have well supported sparse tensor types AFAIK

Yes I think this will work fine for now. I'll have a TACO.jl and associated packages to provide those eventually, hopefully in the next few months.

We can take care of speeding up num_features > 1 as that comes around.

My plan right now: SuiteSparse:GraphBLAS where it's possible, TACO everywhere else. SS:GrB will be faster in most cases (often significantly), but TACO will provide backup.

rayegun commented 2 years ago

Also one other thing I forgot to mention. In the medium-term (probably 1-a few years) SuiteSparse:GraphBLAS will support fast user defined datatypes. This makes it pretty easy to add support for multidimensional edge features by just having a tuple/vector type.

I will be exposing the current functionality to do so shortly in SS:GrB, although user defined functions/datatypes are currently quite slow since they use pointers for everything.

So we will hopefully have two backends with different tradeoffs: TACO.jl and SS:GrB, both of which should support the full set of computations required with different speed tradeoffs depending on the computation.

learning-chip commented 2 years ago

So we will hopefully have two backends with different tradeoffs: TACO.jl and SS:GrB

Just notice a promising new work Compiler Support for Sparse Tensor Computations in MLIR. Table 1-4 contain comparison to TACO. Worth considering.

rayegun commented 2 years ago

Ooh I'm glad that finally came out, Aart has been working on that dialect for a while.

CarloLucibello / GraphNeuralNetworks.jl