DeepGraphLearning / torchdrug

A powerful and flexible machine learning platform for drug discovery
https://torchdrug.ai/
Apache License 2.0
1.43k stars 199 forks source link

[Problem] COO Sparse Tensor Multiplication Is Very Slow. #173

Open mrzzmrzz opened 1 year ago

mrzzmrzz commented 1 year ago

I found the sparse tensor multiplication is very slow in the GearNet module.

Here is the main code in the message_and_aggerate :

adjacency = utils.sparse_coo_tensor(
  torch.stack([node_in, node_out]), graph.edge_weight,
  (graph.num_node, graph.num_node * graph.num_relation)
)

When I leveraged the CSR sparse tensor to replace the original COO sparse tensor, the time spent running GearNet to predict protein labels was reduced by about 50%, e.g., from 16 minutes one epoch to 8 minutes per epoch for RTX 3090 (batch size : 8, GPU: 1).

I'm not sure whether this problem is caused by my own GPU device or the type of sparse tensor. If it's the latter, maybe I can open a pull request for it.

KiddoZhu commented 1 year ago

That's a good catch! Do you know based on which PyTorch version you observe this speedup?

CSR is more efficient for matrix multiplication, while COO is more efficient for editing sparse matrices. We are not confident about the coverage of CSR in PyTorch so we fall back to COO everywhere. If CSR is well supported by PyTorch now, we will update TorchDrug accordingly. This will bring a huge acceleration to many GNN models.

mrzzmrzz commented 1 year ago

My Pytorch version is 1.12.1 with cuda 11.6. As far as I know, Pytorch support the CSR format in recent version.