Open zhangzaibin opened 1 year ago
Me too have this problem
Varying machines exhibit different behaviours. Can you attempt multiple tries?
Yes, for me the problem goes away when i set worker to 0 (not always the case) or run in a docker environment (no error what soever). Another problem is setting large number of worker such as 4 (default) filled up my 32 gb memory.
it is a CUDA memory error? what(): CUDA error: an illegal memory access was encountered
我也遇到了相似的问题,:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1:
我也遇到了相似的问题,: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)
已放弃 (核心已转储)
很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行
I also encountered this issue. Deleting the ./VoxFormer/deform_attn_3d directory and re-uploading it resolved the issue. I'm curious about the reason and hope the author can provide an explanation.
I don't know if it will be quite helpful to you guys.
In my case, it resulted from class names VoxFormerLayer
in projects > mmdet3d_plugin > voxformer > modules > encoder.py
when I set batch size over 1. As the input passes through multiple layers of transformers and normalization, the batch normalization layer produces infinite values. Thus, I am using batch size 1, not a fundamental solution.
Thanks for your great work. I have a issue. In stage2, my preds are nan at the start of training and it turns out error. Have you ever encounted this problem? I train using VoxFormer-T