move FeaturedGraph to COO format

CarloLucibello commented 3 years ago

Right now the FeaturedGraph type has arbitrary underlying storage (e.g. adjacency_matrix, adjacency list, ...). This is not ideal, since gnn layers have to perform additional computations (e.g. producing the adjacency list) and optimizing the gnn implementation on an arbitrary FeaturedGraph becomes hard.

I think we should restrict the FeaturedGraph type to store the graph in a COO format, which should be well adapted to gnn operations since it is what PyG and DGL use https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs https://docs.dgl.ai/guide/graph-graphs-nodes-edges.html#guide-graph-graphs-nodes-edges

yuehhua commented 3 years ago

For message-passing scheme, adjacency list or COO format can be a ideal solution. However, for GCNConv or ChebConv layers, an adjacency matrix is required to compute Laplacian matrix. Indexing is required for message-passing scheme, while algebraic calculation is required for GCNConv or ChebConv layers. They have fundamentally different requirements. So, I just keep the graph representation as input and transform them in layer on demand.

rayegun commented 3 years ago

Message passing is also (optionally?) done by GSPMM and GSDDMM in more recent DGL releases I think? At least that's how I've understood their recent papers.

yuehhua commented 3 years ago

Yes, the approach that DGL takes is use of sparse matrix directly and relies on GSPMM or GSDDMM. But for message-passing is not fulfilled by matrix multiplication operations, which belongs to algebraic operations.

yuehhua commented 3 years ago

Since COO format is one kind of sparse array format/implementation, the point we should consider is to support sparse array instead of specific format for computation.

In the beginning, I had considered using SparseMatrixCSC as main graph representation for GeometricFlux. However, it didn't support AD and CUDA computation at that time. I benchmarked the computation of matrix multiplication over Array and SparseMatrixCSC. They show similar computation efficiency, so I decided using generic array for implementation.

For now, SuiteSparseGraphBLAS.jl emerges as a super star for sparse array and prove itself have better performance over SparseMatrixCSC. Additionally, the state of Julia in JuliaCon states that SuiteSparseGraphBLAS.jl could be a candidate for better sparse array implementation. These strengthen the use of SuiteSparseGraphBLAS.jl as a graph representation for GeometricFlux, so I raised the issue #213.

Technically, sparse array should implement efficient operations for algebraic operations, e.g. + and *. This fulfills the need for spectral-based graph convolution. Further, the values of sparse array should be stored as one of the format in a dense manner. The sparse array format aligns values in memory really well. So, it is memory-efficient for spatial-based graph convolution to index their neighbor for further computations. This way, I think, is the perfect graph representation for both graph convolution operations.

There is one more point. If we implement our own data structure, we have to maintain it by our own. If we use sparse arrays from SparseArrays or SuiteSparseGraphBLAS.jl, there are maintainer who maintain the data structure for us. It would be better to leverage the existing data structure.

yuehhua commented 3 years ago

Generalized Sparse Matrix Multiplication in DGL is really different from matrix multiplication. Regular matrix multiplication is sum(A[i, k] * B[k, j]). It requires multiplication operator * and aggregation operation sum to work with. The Generalized Sparse Matrix Multiplication generalizes these two operations. i.e. operators can be add/sub/mul/div and aggregation operations can be sum/max/min/mean. It can be achieved by mapreduce((x,y) -> x * y, +, A, B) in Julia. (I didn't check if the syntax works but expression my intention) Especially, in their setting, the matrix A is for node features and matrix B is for edge features. This design is relatively limited and not straight forward to me.

CarloLucibello commented 3 years ago

it would be very good to have something like that. Unfortunately, it cannot be obtained with a mapreduce operation like mapreduce((x,y) -> x * y, +, A, B), it would require broadcast compatible A and B.

The goodness of GSMM is that it performs fused gather-message-scatter, without materializing the node features over the edges. The naive implementation in terms of scatter and gather would be (using #204)

function GSMM_v0(fg::FeaturedGraph, layer::MessagePassing, X, E, aggr) 
    s, t = edge_index(fg)
    Xi = NNlib.gather(X, t)    # materialing array with size num_features x num_edges   :(
    Xj = NNlib.gather(X, s)
    M = message(layer, Xi, Xj, E)
    M̄ = NNlib.scatter(aggr, M, t)
    return M̄
end

Maybe we can create a version of gather returning a view? Although I think the backpropagated sensitivities would still be materialized on each edge instead of directly accumulated, so probably not a good strategy.

rayegun commented 3 years ago

GraphBLAS is a strict generalization of DGL's GSPMM, at least when both A and B are matrices (DGL supports tensors, which could be achieved in GraphBLAS by either matricizing or using multiple spmm passes).

Arbitrary semirings can be used, and the ones used by DGL are built in (like the tropical semirings (max/min, +).

The one issue I have right now is chain rules for those tropical semirings, since you need to store information for the reverse mode rule. I'll have one, but it won't necessarily be incredibly fast.

The other semirings will use native GraphBLAS operations and will be fast. I'm assuming to make a draft PR next week.

DGL also has gSDDMM, which can be done in GraphBLAS, but won't fuse in the current implementation. It will in the future though, without any changes from our end.

E: note: the tensor issue is really only present for multiple features on edges. DGL uses a sparse matrix and a tensor of edge features. In GraphBLAS this would probably be expressed as multiple sparse matrices, or as one sparse matrix with dense vector entries. Currently the former is fast and the latter is slow. This is because GraphBLAS is currently only fast for primitive types.

yuehhua commented 3 years ago

it would require broadcast compatible A and B.

Yes, it is surely require broadcast compatible.

Maybe we can create a version of gather returning a view?

In my (previous) solution, it just require simply view(X, :, s). We don't need anymore. It becomes

_view(A::AbstractMatrix, idx) = view(A, :, idx)

function aggregate_neighbors(fg::FeaturedGraph, layer::MessagePassing, X, E, aggr) 
    s, t = edge_index(fg)
    Xi = _view(X, t)
    Xj = _view(X, s)
    M = message(layer, Xi, Xj, E)
    M̄ = NNlib.scatter(aggr, M, t)
    return M̄
end

This way, all we need is scatter operation. DGL's GSPMM is just an alternative to scatter operations. For now, we have implemented a various kinds of scatter operations in NNlib and NNlibCUDA and make it support CPU and GPU. Why not just keep using it?

yuehhua commented 3 years ago

According to DGL: GSpMM functions, generalized sparse matrix multiplication is decomposed into two steps:

Generalized Sparse-Matrix Dense-Matrix Multiplication functions. It fuses two steps into one kernel.

Computes messages by add/sub/mul/div source node and edge features, or copy node features to edges.

Aggregate the messages by sum/max/min/mean as the features on destination nodes.

On the other hand, message-passing scheme is defined in equation 1. It can be transformed into code snippets such as:

sum(w -> message(h[:v], h[w], e[:v, w]), neighbors(:v))

The generalized sparse matrix multiplication approach restricts the message function to be either of add/sub/mul/div, such as F.u_mul_e_sum is equivalent to:

sum(w -> h[w] * e[:v, w], neighbors(:v))

However, message is, generally, a function. It can be a operator or even a neural network. And currently we do have EdgeConv layer which accepts a neural network as message function in GeometricFlux. Thus, this cannot fit-in the generalized sparse matrix multiplication approach.

Secondly, the aggregate function can not only be sum/max/min/mean, but also be concatenate. A semiring requires aggregate function to be commutative, while concatenate is not commutative. vcat(x, y) != vcat(y, x) This is required in GATConv layer in GeometricFlux.

Overall, there are two-fold reason I cannot take generalized sparse matrix multiplication approach as an option for message-passing scheme. I don't omit the possibility of generalized sparse matrix multiplication approach, but, in reality, it only support some of operations.

yuehhua commented 3 years ago

Close this due to merge of yuehhua/GraphSignals.jl#54.

FluxML / GeometricFlux.jl

move FeaturedGraph to COO format #200