chensnathan / YOLOF

You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2
MIT License
271 stars 28 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #12

Closed hhaAndroid closed 3 years ago

hhaAndroid commented 3 years ago

When I train on a cluster machine for 1080ti or XP(Cuda 9.0、pytorch1.5), the above error appears, but not on v100(cuda10.1、pytorch1.6) and local (cuda10.2、pytorch1.5). Do you know the reason?

    pred_class_logits[valid_idxs],
RuntimeError: CUDA error: an illegal memory access was encountered                                   
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f8e89c2c046 in /mnt/lustre/share/spring/conda_envs/miniconda3/envs/r0.3.3/lib/python3.6/site-packa
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xa0a (0x7f8e89e6674a in /mnt/lustre/share/spring/conda_envs/miniconda3/envs/r0.3.3/lib/python3.6/site-packages/tor
frame #2: c10::TensorImpl::release_resources() + 0xb6 (0x7f8e89c1a786 in /mnt/lustre/share/spring/conda_envs/miniconda3/envs/r0.3.3/lib/python3.6/site-packages/torch/lib/libc10.s
frame #3: <unknown function> + 0x56d420 (0x7f8ecafd7420 in /mnt/lustre/share/spring/conda_envs/miniconda3/envs/r0.3.3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x56d4e1 (0x7f8ecafd74e1 in /mnt/lustre/share/spring/conda_envs/miniconda3/envs/r0.3.3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

I double-checked, the error is caused by the following code

gt_classes[src_idx] = target_classes_o

When I modify it to the following code, there will be no error

gt_classes = gt_classes.cpu()
src_idx = src_idx.cpu()
target_classes_o = target_classes_o.cpu()
gt_classes[src_idx] = target_classes_o
gt_classes = gt_classes.to(pred_class_logits.device)

But I don't know why ? Looking forward to your reply.

chensnathan commented 3 years ago

I didn't encounter this error before as I trained all models on 2080Ti with CUDA10.2. It seems that this bug is related to the Cuda version.

BTW, have you tried to debug with CUDA_LAUNCH_BLOCKING=1? Does it give the same cuda error?

hhaAndroid commented 3 years ago

Thank you, I will try it!