Closed minsikseo-cdl closed 2 years ago
Hey @minsikseo-cdl,
Would it be possible for you to create a reproducible script using a simple model from PyG using Lightning as the trainer?
cc @rusty1s
@tchaton uh... moment, please. dataset itself is confidential, so I'm afraid I can't provide it because it's an industry-academic project. But here is the part of my code which the problem was occurred:
import torch
from torch import nn
from torch_geometric import nn as gnn
from torch_geometric.utils import remove_self_loops, add_self_loops, sort_edge_index
from torch_geometric.nn.pool.topk_pool import topk, filter_adj
from torch_scatter import scatter
from torch_sparse import spspmm
def augment_adj(edge_index, edge_weight, num_nodes):
edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
edge_index, edge_weight = add_self_loops(edge_index, edge_weight,
num_nodes=num_nodes)
edge_index, edge_weight = sort_edge_index(edge_index, edge_weight,
num_nodes)
edge_index, edge_weight = spspmm(edge_index, edge_weight, edge_index,
edge_weight, num_nodes, num_nodes,
num_nodes)
edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
return edge_index, edge_weight
class TopKPooling(nn.Module):
def __init__(self, in_channels, ratio=0.5, act='Tanh', param={}, **kwargs):
super().__init__()
self.in_channels = in_channels
self.ratio = ratio
self.act = getattr(nn, act)(**param)
self.p = nn.Parameter(torch.Tensor(in_channels))
self.reset_parameters()
def reset_parameters(self):
size = self.in_channels
gnn.inits.uniform(size, self.p)
def forward(self, x, edge_index, edge_attr=None, batch=None):
if batch is None:
batch = edge_index.new_zeros(x.size(0))
y = torch.matmul(x, self.p) / self.p.norm(p=2, dim=-1)
perm = topk(y, self.ratio, batch)
x = x[perm] * self.act(y[perm].view(-1, 1))
batch = batch[perm]
edge_index, edge_attr = filter_adj(
edge_index, edge_attr, perm, num_nodes=y.size(0))
return x, edge_index, edge_attr, batch, perm, y[perm]
class GCNBlock(nn.Module):
def __init__(self,
in_channels, out_channels, norm=True, GNN='GCNConv',
act='LeakyReLU', param={'negative_slope': 0.2, 'inplace': True},
**kwargs):
super().__init__()
self.norm = norm
self.conv = getattr(gnn, GNN)(in_channels, out_channels, bias=not norm, **kwargs)
if norm:
self.bn = nn.BatchNorm1d(out_channels)
self.act = getattr(nn, act)(**param)
def forward(self, x, edge_index, edge_weight=None):
x = self.conv(x, edge_index, edge_weight)
if self.norm:
x = self.bn(x)
return self.act(x)
class PoolNet(nn.Module):
def __init__(self,
in_channels, out_channels,
num_layers=3, GNN_param={},
ratio=0.5, pool_param={},
**kwargs):
super().__init__()
self.num_layers = num_layers
# Top K Pooling (Gao & Ji, 2019)
self.pool = TopKPooling(
in_channels, ratio, **pool_param)
# define GNN before pooling
self.conv = nn.ModuleList()
self.conv.append(GCNBlock(
in_channels, out_channels, **GNN_param))
for _ in range(num_layers - 1):
self.conv.append(GCNBlock(
out_channels, out_channels,
**GNN_param))
def forward(self, x, edge_index, batch=None):
if batch is None:
batch = edge_attr.new_zeros(x.size(0))
edge_weight = x.new_ones(edge_index.size(1))
# Augmentation
edge_index, _ = augment_adj(edge_index, edge_weight, x.size(0))
# Pooling
out, edge_index, _, batch, _, _ = \
self.pool(x, edge_index, None, batch=batch)
# Convolution
for layer in self.conv:
out = layer(out, edge_index)
return out, edge_index, batch
The error was occurred in augment_adj
in PoolNet.forward
.
My LightningModule
is consisted of PoolNet
and the rest parts are as usual.
Sorry about that I cannot provide any detail, again 😥
Dear @minsikseo-cdl,
Would you mind mocking the data to ensure the code is reproducible ?
@tchaton
I just synthesized a dataset and script as you requested, but it worked... (it's weird)
I think this issue arises at a certain case in the middle of the process (e.g., OOM but pytorch_lightning
may not catch or something else)
In fact, I found another issue related to torch_sparse
#191 for a specific data or batch combination.
I think it might cause CUDAError mentioned above.
btw, I attach the code and data I made issue.zip (but it does not reproduce the CUDAError)
I sincerely appreciate your contribution to develop and maintain this package.
I think this issue is more related to pytorch_sparse
rather than pytorch_lightning
. I suggest to close this issue and continue our discussion in https://github.com/rusty1s/pytorch_sparse/issues/191.
@rusty1s @tchaton I think so. Or it might related to OOM just in my case only. Thanks :)
🐛 Bug
Hi, I'm struggling with a CUDA error 😥
Following arises the error
CUDA error: an illegal memory access was encountered
.Specifically while using
spspmm
in my model. My model is composed of a fewtorch_geometric
layers. I'm afraid that it's too complex to describe the whole architecture in here. But the error occurred atC = matmul(A, B)
inspspmm
I've tried (1) no
strategy
, (2)strategy='dp'
and (3)strategy=DDPPlugin()
for theTrainer
's argument.Also, my model is working when I use it on CPU and GPU manually, as
In this case, both
.to('cuda:0')
and.cuda(0)
work fine.Environment
pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta