SegFormerB1 CityScapes - CUDA error: an illegal memory access was encountered

adriengoleb commented 1 year ago

Hello,

I want to run the training code as follow with 1 GPU : python tools/train.py local_configs/segformer/B1/segformer.b1.1024x1024.city.160k.py --gpus 1

Firstly, I got the error AssertionError: Default process group is not initialized. I followed the others comments by replacing all SyncBN with BN in the code.

After that, I obtain this error :

Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 155, in main
    train_segmentor(
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/apis/train.py", line 115, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 153, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 204, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629395347/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fa9dae8377d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fa9db0d3d9d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fa9dae6fb1d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53956b (0x7faa189e156b in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xf5 (0x7faa45c2a555 in /lib64/libc.so.6)

Aborted

Same error when I launch the command tools/dist_train.sh local_configs/segformer/B1/segformer.b1.1024x1024.city.160k.py 1 with dist_train.sh

I always obtain this CUDA error: an illegal memory access was encountered ...

hhhyyeee commented 3 months ago

hi, same problem occurring here. any progress?

zenhanghg-heng commented 3 weeks ago

form the ' log_vars[loss_name] = loss_value.item()' ,i think it happens for the wrong index labels in the dataset enhancing process.

checking you labels:

it should be using the id for each class rather than the rgb labels
checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.

for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),

NVlabs / SegFormer

SegFormerB1 CityScapes - CUDA error: an illegal memory access was encountered #126