RuntimeError: CUDA error: an illegal memory access was encountered

eddiewrc commented 2 years ago

Hi, first of all thanks for sharing this library with all of us! Unfortunately I am encountering few problems while trying to run it. In particular, I tried to build the following network, which is supposed to take as input a sparse tensor of shape (8192, 16384). Part of it is now commented because I tried to locate the origin of the problem, and apparently it happens just with just the first Convolution module (so I commented the rest for now)

The error that I get is pasted below. The GPU is a quadro gv100, system cuda version 11.4, pytorch 1.11.0 py3.9_cuda11.3_cudnn8.2.0_0

class HCSparseConvNet1(t.nn.Module):
        def __init__(self, featSize, numOut, size, name = "NN"):
                super(HCSparseConvNet1, self).__init__()
                print(size)
                self.inputLayer = scn.InputLayer(2, size, 2)

                self.sparseModel = scn.Sequential(scn.Convolution(2,1,4,8,8, True))#, scn.Convolution(2,4,8,8,4, True), scn.LeakyReLU(), scn.Convolution(2,8,16,3,2,True), scn.LeakyReLU(), scn.Convolution(2,16,16, 3,2, True), scn.SparseToDense(2, 16))#, scn.MaxPooling(2,16,8), scn.Convolution(2, 10,10,64,32, False))
                self.out1 = t.nn.Sequential(t.nn.GroupNorm(1,16), t.nn.Tanh(), t.nn.Conv2d(16,8,3,2), t.nn.GroupNorm(1,8), t.nn.Tanh(), t.nn.Conv2d(8,4,3,1, padding=1), t.nn.GroupNorm(1,4), t.nn.Tanh())
#self.spatial_size= self.sparseModel.input_spatial_size(size)
                self.final = t.nn.Sequential(t.nn.Linear(7812, 100), t.nn.LayerNorm(100), t.nn.Tanh(), t.nn.Linear(100, numOut))

        def forward(self, x, batchSize):
                #print(x[0].size(), x[1].size())
                x = self.inputLayer(x)
                x = self.sparseModel(x)
                print(x)
                #x = self.out1(x)
                #print(x.size())
                #x = self.final(x.view(batchSize, -1))
                return x

The error:


Traceback (most recent call last):
  File "/home/eddiewrc/galiana2/galianaHCsparseConvNet.py", line 144, in <module>
    sys.exit(main(sys.argv))
  File "/home/eddiewrc/galiana2/galianaHCsparseConvNet.py", line 94, in main
    wrapper.fit(X, Y, device, epochs=50, batch_size = 11, LOG=False)
  File "/home/eddiewrc/galiana2/sources/HCModels.py", line 200, in fit
    yp = self.model.forward([coord, features], batchSize)
  File "/home/eddiewrc/galiana2/sources/HCModels.py", line 58, in forward
    print(x)
  File "/home/eddiewrc/SparseConvNet/sparseconvnet/sparseConvNetTensor.py", line 58, in __repr__
    'features=' + repr(self.features) + \
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 305, in __repr__
    return torch._tensor_str._str(self)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 434, in _str
    return _str_intern(self)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 409, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 264, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 296, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
```.

eddiewrc commented 2 years ago

I have an addition to make: this is the GPU settings on my machine (3 gpus). Apparently the error happens just when I try to use GPUs 1 and 2, and the library works ok on what pytorch recognzes as cuda:0 . (which happens to be Quadro #1 )

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:09:00.0 Off |                  N/A |
| 30%   52C    P2    65W / 250W |   1521MiB / 12196MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro GV100        Off  | 00000000:83:00.0 Off |                  Off |
| 38%   52C    P2    40W / 250W |   3379MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro GV100        Off  | 00000000:84:00.0 Off |                  Off |
| 36%   49C    P2    40W / 250W |      8MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

AndreGraca98 commented 2 years ago

Hello, i also had this issue but I found a workaround. If you do torch.cuda.set_device(1) before sending the model to the device with model.to('cuda:1') it works fine :)

facebookresearch / SparseConvNet

RuntimeError: CUDA error: an illegal memory access was encountered #231