CUDA graphs - Githubissues

Exscientia / physicsml

A package for all physics based/related models

MIT License

40 stars 1 forks source link

CUDA graphs #33

Open peastman opened 1 month ago

peastman commented 1 month ago

Would it be possible to make PhysicsML support CUDA graphs? In some cases it can provide a large speed improvement.

The main requirements to be graph compatible are that tensor shapes have to be static, and it doesn't support dynamic control flow. These also are similar to the requirements for torch.export(), which will be an important mechanism to support once the API stabilizes.

wardhaddadin1 commented 1 month ago

Hello!

I had a look at this: It seems to require discarding most python data types (like dicts, etc..). Currently this is what we use to pass input data to models (it's a bit tidier than specifying things explicitly).

I dont think there are restrictions or issues with doing this, just requires some work. But from my understanding, this provides speed-ups only if a model is limited by kernel launch time which most of the models are not (they are limited by tensor product operations, I think only ANI is limited by that). I have not done a proper benchmarking though.

Im not opposed to this, but I don't think I have the bandwidth to do it currently. If you (or anyone really), would like to submit a PR, Im happy to review and go on from there!

Best, Ward

peastman commented 1 month ago

Where are you seeing that you can't use dicts? CUDA graphs are implemented at the level of the driver. All they care about is the sequence of CUDA kernels that get executed and the inputs that are passed to them. That's both a strength—they don't care what the Python code to launch the kernels does—and a weakness—they can't capture any of the logic from the Python code. See https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs.

wardhaddadin1 commented 2 weeks ago

Hello!

Sorry for the radio silence. So to start checking what needs to change in the code, you can try this script. It gets a batch from a featurised dataset (qm9), instantiates a model (egnn), and then tried to make a cuda graph. It currently fails at converting a tensor to an int (the number of graphs in the batch).

from molflux.datasets import load_dataset
from molflux.core import featurise_dataset
from molflux.modelzoo import load_from_dict as load_model_from_dict
import torch

# generate featurised dataset
dataset = load_dataset("gdb9", "rdkit").select(range(100))
featurisation_metadata = {
    "version": 1,
    "config": [
        {
            "column": "mol_bytes",
            "representations": [
                {
                    "name": "physicsml_features",
                    "config": {
                        "atomic_number_mapping": {
                            1: 0,
                            6: 1,
                            7: 2,
                            8: 3,
                            9: 4,
                        },
                        "backend": "rdkit",
                    },
                    "as": "{feature_name}"
                }
            ]
        }
    ]
}

# featurise the mols
featurised_dataset = featurise_dataset(
    dataset,
    featurisation_metadata=featurisation_metadata,
    num_proc=4,
    batch_size=100,
)

# load and instantiate model
model_config =     {
    "name": "egnn_model",
    "config": {
        "x_features": [
            'physicsml_atom_idxs',
            'physicsml_atom_numbers',
            'physicsml_coordinates',
        ],
        "y_features": [
            'homo'
        ],
        "num_node_feats": 5,
        "y_graph_scalars_loss_config": {
            "name": "MSELoss",
            "weight": 1.0,
        },
        "optimizer": {
            "name": "AdamW",
            "config": {
                "lr": 1e-3,
            }
        },
        "scheduler": None,
        "datamodule": {
            "y_graph_scalars": ['homo'],
            "num_elements": 5,
            "cut_off": 5.0,
            "train": {"batch_size": 64},
            "validation": {"batch_size": 128},
        },
        "trainer": {
            "max_epochs": 10,
            "accelerator": "cpu",
            "logger": False,
        },
        "compile": True
    }
}

model = load_model_from_dict(model_config)
model.module = model._instantiate_module()

# get example batch
batch_dict = next(iter(model._instantiate_datamodule(predict_data=featurised_dataset).predict_dataloader()))
batch_dict = model.module.graph_batch_to_batch_dict(batch_dict)

# move module and batch to cuda
model.module.to("cuda")
for k, v in batch_dict.items():
    batch_dict[k] = v.to("cuda")

# check inference
print(model.module(batch_dict))

# make cuda graph
mod = torch.cuda.make_graphed_callables(model.module, (batch_dict,), allow_unused_input=True)

# check inference
print(mod(batch_dict))

Let me know if this works for you!

Best, Ward

peastman commented 1 week ago

The first error we get is

  File "/home/peastman/miniconda3/envs/physicsml/lib/python3.11/site-packages/physicsml/models/egnn/egnn_utils.py", line 498, in forward
    dim_size=int(data["num_graphs"]),
             ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not permitted when stream is capturing

Apparently it doesn't allow converting a tensor to a python int inside a graph. The int gets passed to scatter(), which really does require an int, not a tensor. What does the value represent? Could it be passed in some other way? Or could it be omitted? That argument to scatter() is marked as optional. If it's omitted, it produces a minimal sized output.

I tried hardcoding dim_size=1 to get past that error. It then fails later when attempting the backward pass:

Traceback (most recent call last):
  File "/home/peastman/miniconda3/envs/physicsml/lib/python3.11/site-packages/torch/cuda/graphs.py", line 377, in make_graphed_callables
    grad_inputs = torch.autograd.grad(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/peastman/miniconda3/envs/physicsml/lib/python3.11/site-packages/torch/autograd/__init__.py", line 394, in grad
    result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream

I'm not sure what's causing that. https://github.com/pytorch/pytorch/issues/15623 discusses problems in pytorch that caused graph capture to fail for backward passes, but they were supposedly fixed a few years ago. I found lots of other reports of that error message, but none that seemed relevant. It seems to be an error that can be caused by a lot of things.

wardhaddadin1 commented 1 week ago

Yeah that's the one I saw too. This is required for scatter to aggregate the messages properly. It cannot be left as None, since some batches might not have messages to all nodes (where the aggregated tensor will have zeros for those nodes) and leaving dim_size=None will "squash" those zeros.

Another restriction is that the batch_dict must be of the type Dict[str, torch.Tensor] since torchscript requires variables to have static types (and does not support mixed types like Unions).

So unless we figure out how to cast this tensor to an int, it seems like torchscript and cuda graphs are incompatible?

peastman commented 1 week ago

Does dim_size correspond to the batch size? Here is how TorchMD-Net handles that problem:

is_capturing = x.is_cuda and is_current_stream_capturing()
if not x.is_cuda or not is_capturing:
    self.dim_size = int(batch.max().item() + 1)
if is_capturing:
    assert (
        self.dim_size > 0
    ), "Warming up is needed before capturing the model into a CUDA graph"
    warn(
        "CUDA graph capture will lock the batch to the current number of samples ({}). Changing this will result in a crash".format(
            self.dim_size
        )
    )

It requires that the batch size be constant over all batches. It retrieves the batch size during the warmup steps, then uses that value during graph capture instead of trying to look it up again. Could PhysicsML do the same thing?

It also would be fine with me if it only supports CUDA graphs during inference, not training, and only for working with a single fixed molecule. My main concern is to speed up running simulations, not to speed up training.

wardhaddadin1 commented 1 week ago

it doesnt necessarily correspond to the batch size (again because the batch tensor might not include the nodes of all the graphs). But we might be able to do something similar.

I also looked at the previous error you got and I think I might know what is happening. Here is a minimal failing example:

from typing import Any, Dict, List, Literal, Optional, Tuple

import torch
import torch.nn as nn

class PooledEGNNModule(torch.nn.Module):
    """
    Class for pooled egnn model
    """

    def __init__(self, **kwargs: Any) -> None:
        super().__init__()

        self.egnn = torch.nn.Linear(5, 12)

    def forward(
        self,
        data: Dict[str, torch.Tensor],
    ) -> torch.Tensor:

        data["node_feats"] = self.egnn(data["node_attrs"])
        output = data["node_feats"]        
        return output

model = PooledEGNNModule()

batch_dict = {
    'node_attrs': torch.tensor([[0., 1., 0., 0., 0.],
            [1., 0., 0., 0., 0.],
            [1., 0., 0., 0., 0.],
            [1., 0., 0., 0., 0.],
            [1., 0., 0., 0., 0.]]),
}

# get example batch

# move module and batch to cuda
model.to("cuda")
for k, v in batch_dict.items():
    batch_dict[k] = v.to("cuda")

# check inference
print(model(batch_dict))

# make cuda graph
mod = torch.cuda.make_graphed_callables(model, (batch_dict,), allow_unused_input=True)

# check inference
print(mod(batch_dict))

But if you try

class PooledEGNNModule(torch.nn.Module):
    """
    Class for pooled egnn model
    """

    def __init__(self, **kwargs: Any) -> None:
        super().__init__()

        self.egnn = torch.nn.Linear(5, 12)

    def forward(
        self,
        data: Dict[str, torch.Tensor],
    ) -> torch.Tensor:

        new_data = {}
        for k, v in data.items():
            new_data[k] = v

        new_data["node_feats"] = self.egnn(data["node_attrs"])
        output = new_data["node_feats"]

        return output

then it works. It seems that cuda graphs dont like expanding dictionaries (or objects in general I guess). We would have to go through all instances of adding tensors to dicts and modify them in this way (not sure how much overhead that will add).

peastman commented 1 week ago

Possibly what it objects to is that you're modifying an input argument? It wants Python objects passed as arguments to be immutable, but it's fine with creating and modifying new objects within the method.

This doesn't seem like it should add significant overhead. You aren't copying the tensors, just putting references to the existing ones in a new dict. You could write it as new_data = copy(data). deepcopy() would be expensive in this case, but copy() should be cheap.