Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.2k stars 3.37k forks source link

CUDA error: an illegal memory access was encountered while using PyG and torch_sparse #11302

Closed minsikseo-cdl closed 2 years ago

minsikseo-cdl commented 2 years ago

🐛 Bug

Hi, I'm struggling with a CUDA error 😥

Following arises the error CUDA error: an illegal memory access was encountered.

model = Model(**input_args)

trainer = trainer = Trainer(
    num_sanity_val_steps=0,
    gpus=1, auto_select_gpus=True,
    strategy='dp',
    max_time=timedelta(hours=1),
    logger=TensorBoardLogger(
        default_hp_metric=False,
        save_dir='/workspace/logs_v2.0',
        name=f'hood_glift'),
    callbacks=[
        LearningRateMonitor(logging_interval='epoch'),
    ]
)

trainer.fit(model, loader)

Specifically while using spspmm in my model. My model is composed of a few torch_geometric layers. I'm afraid that it's too complex to describe the whole architecture in here. But the error occurred at C = matmul(A, B) in spspmm

I've tried (1) no strategy, (2) strategy='dp' and (3) strategy=DDPPlugin() for the Trainer's argument.

Also, my model is working when I use it on CPU and GPU manually, as

model = Model(**input_args)
model.to('cuda:0')
data = next(iter(loader)).to('cuda:0')
out = model(data)

In this case, both .to('cuda:0') and .cuda(0) work fine.

Environment

cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta

tchaton commented 2 years ago

Hey @minsikseo-cdl,

Would it be possible for you to create a reproducible script using a simple model from PyG using Lightning as the trainer?

cc @rusty1s

minsikseo-cdl commented 2 years ago

@tchaton uh... moment, please. dataset itself is confidential, so I'm afraid I can't provide it because it's an industry-academic project. But here is the part of my code which the problem was occurred:

import torch
from torch import nn
from torch_geometric import nn as gnn
from torch_geometric.utils import remove_self_loops, add_self_loops, sort_edge_index
from torch_geometric.nn.pool.topk_pool import topk, filter_adj
from torch_scatter import scatter
from torch_sparse import spspmm

def augment_adj(edge_index, edge_weight, num_nodes):
    edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
    edge_index, edge_weight = add_self_loops(edge_index, edge_weight,
                                                num_nodes=num_nodes)
    edge_index, edge_weight = sort_edge_index(edge_index, edge_weight,
                                                num_nodes)
    edge_index, edge_weight = spspmm(edge_index, edge_weight, edge_index,
                                        edge_weight, num_nodes, num_nodes,
                                        num_nodes)
    edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
    return edge_index, edge_weight

class TopKPooling(nn.Module):
    def __init__(self, in_channels, ratio=0.5, act='Tanh', param={}, **kwargs):
        super().__init__()
        self.in_channels = in_channels
        self.ratio = ratio
        self.act = getattr(nn, act)(**param)

        self.p = nn.Parameter(torch.Tensor(in_channels))

        self.reset_parameters()

    def reset_parameters(self):
        size = self.in_channels
        gnn.inits.uniform(size, self.p)

    def forward(self, x, edge_index, edge_attr=None, batch=None):
        if batch is None:
            batch = edge_index.new_zeros(x.size(0))

        y = torch.matmul(x, self.p) / self.p.norm(p=2, dim=-1)

        perm = topk(y, self.ratio, batch)
        x = x[perm] * self.act(y[perm].view(-1, 1))

        batch = batch[perm]
        edge_index, edge_attr = filter_adj(
            edge_index, edge_attr, perm, num_nodes=y.size(0))

        return x, edge_index, edge_attr, batch, perm, y[perm]

class GCNBlock(nn.Module):
    def __init__(self,
                 in_channels, out_channels, norm=True, GNN='GCNConv',
                 act='LeakyReLU', param={'negative_slope': 0.2, 'inplace': True},
                 **kwargs):
        super().__init__()
        self.norm = norm
        self.conv = getattr(gnn, GNN)(in_channels, out_channels, bias=not norm, **kwargs)
        if norm:
            self.bn = nn.BatchNorm1d(out_channels)
        self.act = getattr(nn, act)(**param)

    def forward(self, x, edge_index, edge_weight=None):
        x = self.conv(x, edge_index, edge_weight)
        if self.norm:
            x = self.bn(x)
        return self.act(x)

class PoolNet(nn.Module):
    def __init__(self,
                 in_channels, out_channels,
                 num_layers=3, GNN_param={},
                 ratio=0.5, pool_param={},
                 **kwargs):
        super().__init__()

        self.num_layers = num_layers

        # Top K Pooling (Gao & Ji, 2019)
        self.pool = TopKPooling(
            in_channels, ratio, **pool_param)

        # define GNN before pooling
        self.conv = nn.ModuleList()
        self.conv.append(GCNBlock(
            in_channels, out_channels, **GNN_param))
        for _ in range(num_layers - 1):
            self.conv.append(GCNBlock(
                out_channels, out_channels,
                **GNN_param))

    def forward(self, x, edge_index, batch=None):
        if batch is None:
            batch = edge_attr.new_zeros(x.size(0))
        edge_weight = x.new_ones(edge_index.size(1))

        # Augmentation
        edge_index, _ = augment_adj(edge_index, edge_weight, x.size(0))

        # Pooling
        out, edge_index, _, batch, _, _ = \
            self.pool(x, edge_index, None, batch=batch)

        # Convolution
        for layer in self.conv:
            out = layer(out, edge_index)

        return out, edge_index, batch

The error was occurred in augment_adj in PoolNet.forward. My LightningModule is consisted of PoolNet and the rest parts are as usual.

Sorry about that I cannot provide any detail, again 😥

tchaton commented 2 years ago

Dear @minsikseo-cdl,

Would you mind mocking the data to ensure the code is reproducible ?

minsikseo-cdl commented 2 years ago

@tchaton I just synthesized a dataset and script as you requested, but it worked... (it's weird) I think this issue arises at a certain case in the middle of the process (e.g., OOM but pytorch_lightning may not catch or something else) In fact, I found another issue related to torch_sparse #191 for a specific data or batch combination. I think it might cause CUDAError mentioned above.

btw, I attach the code and data I made issue.zip (but it does not reproduce the CUDAError)

I sincerely appreciate your contribution to develop and maintain this package.

rusty1s commented 2 years ago

I think this issue is more related to pytorch_sparse rather than pytorch_lightning. I suggest to close this issue and continue our discussion in https://github.com/rusty1s/pytorch_sparse/issues/191.

minsikseo-cdl commented 2 years ago

@rusty1s @tchaton I think so. Or it might related to OOM just in my case only. Thanks :)