Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.86k stars 1.09k forks source link

RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #6867

Closed Mgithus closed 11 months ago

Mgithus commented 1 year ago

Data information: The dataset info given on colab notebook of this code on monai website (https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_brats21_segmentation_3d.ipynb) is as follows: Modality: MRI Size: 1470 3D volumes (1251 Training + 219 Validation) In 1251 training samples each has 4 3D modalities and 1 3D segmentation mask in it.(1251*5 = 6255 total images) image shape: (240, 240, 155) label shape: (240, 240, 155)

Code information: Trying to run following code from monai website without any modifications: [https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_brats21_segmentation_3d.ipynb]

Error:

Epoch 0/4 569/1001 loss: nan time 0.83s Epoch 0/4 570/1001 loss: nan time 4.15s Traceback (most recent call last): File "notebook_of_swin_unetr.py", line 429, in ) = trainer( File "notebook_of_swin_unetr.py", line 346, in trainer train_loss = train_epoch( File "notebook_of_swin_unetr.py", line 261, in train_epoch loss.backward() File "/home/dlrs/.local/lib/python3.8/site-packages/torch/tensor.py", line 214, in backward return handle_torch_function( File "/home/dlrs/.local/lib/python3.8/site-packages/torch/overrides.py", line 1060, in handle_torch_function result = overloaded_arg.torch_function(public_api, types, args, kwargs) File "/home/dlrs/.local/lib/python3.8/site-packages/monai/data/meta_tensor.py", line 249, in torch_function ret = super().torch_function(func, types, args, kwargs) File "/home/dlrs/.local/lib/python3.8/site-packages/torch/tensor.py", line 995, in torch_function ret = func(*args, *kwargs) File "/home/dlrs/.local/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/dlrs/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( File "/home/dlrs/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, args) # type: ignore File "/home/dlrs/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 99, in backward torch.autograd.backward(outputs, args) File "/home/dlrs/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered (aug10) dlrs@spml3:~/Desktop/jul_25$ python -c 'import monai; monai.config.print_debug_info()' "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.

Printing MONAI config...

MONAI version: 1.0.0 Numpy version: 1.21.6 Pytorch version: 1.7.1+cu110 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: 170093375ce29267e45681fcec09dfa856e1d7e7 MONAI file: /home/dlrs/.local/lib/python3.8/site-packages/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.8 Nibabel version: 5.1.0 scikit-image version: 0.21.0 Pillow version: 10.0.0 Tensorboard version: 2.14.0 gdown version: 4.7.1 TorchVision version: 0.8.2+cu110 tqdm version: 4.66.1 lmdb version: 1.4.1 psutil version: 5.9.5 pandas version: 2.0.3 einops version: 0.6.1 transformers version: 4.31.0 mlflow version: 2.5.0 pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================ Printing system config...

System: Linux Linux version: Ubuntu 20.04.6 LTS Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.17 Processor: x86_64 Machine: x86_64 Python version: 3.8.17 Process name: python Command: ['python', '-c', 'import monai; monai.config.print_debug_info()'] Open files: [popenfile(path='/home/dlrs/.anaconda/navigator/Code/logs/20230814T112632/ptyhost.log', fd=39, position=0, mode='a', flags=33793), popenfile(path='/snap/code/137/usr/share/code/resources/app/node_modules.asar', fd=41, position=64064, mode='r', flags=32768), popenfile(path='/snap/code/137/usr/share/code/v8_context_snapshot.bin', fd=103, position=0, mode='r', flags=32768)] Num physical CPUs: 4 Num logical CPUs: 4 Num usable CPUs: 4 CPU usage (%): [36.9, 40.8, 42.9, 49.4] CPU freq. (MHz): 1994 Load avg. in last 1, 5, 15 mins (%): [59.2, 67.2, 75.0] Disk usage (%): 45.8 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 15.6 Available memory (GB): 9.3 Used memory (GB): 5.7

================================ Printing GPU config...

Num GPUs: 1 Has CUDA: True CUDA version: 11.0 cuDNN enabled: True cuDNN version: 8005 Current device: 0 Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80'] GPU 0 Name: NVIDIA GeForce GTX 1080 Ti GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 28 GPU 0 Total memory (GB): 10.9 GPU 0 CUDA capability (maj.min): 6.1

KumoLiu commented 1 year ago

Hi @Mgithus, I have tried this tutorial with MONAI v1.2 image and I can't reproduce the error. Could you please try it with the latest stable version? Thanks!

Mgithus commented 1 year ago

Thnx @KumoLiu, I have tried but did not able to resolve it.

KumoLiu commented 11 months ago

Hope this can help, https://discuss.pytorch.org/t/summarize-the-reasons-for-the-common-error-illegal-memory-access/130406. Move to discussion, feel free to create another one.