NVlabs / VoxFormer

Official PyTorch implementation of VoxFormer [CVPR 2023 Highlight]
Other
1.07k stars 87 forks source link

preds are nan #28

Open zhangzaibin opened 1 year ago

zhangzaibin commented 1 year ago

Thanks for your great work. I have a issue. In stage2, my preds are nan at the start of training and it turns out error. Have you ever encounted this problem? I train using VoxFormer-T

KSonPham commented 1 year ago

Me too have this problem

RoboticsYimingLi commented 1 year ago

Varying machines exhibit different behaviours. Can you attempt multiple tries?

KSonPham commented 1 year ago

Yes, for me the problem goes away when i set worker to 0 (not always the case) or run in a docker environment (no error what soever). Another problem is setting large number of worker such as 4 (default) filled up my 32 gb memory.

ziming-liu commented 1 year ago

it is a CUDA memory error? what(): CUDA error: an illegal memory access was encountered

willemeng commented 1 year ago

我也遇到了相似的问题,: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6) 已放弃 (核心已转储) 很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行
zzk785089755 commented 10 months ago

我也遇到了相似的问题,: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行

I also encountered this issue. Deleting the ./VoxFormer/deform_attn_3d directory and re-uploading it resolved the issue. I'm curious about the reason and hope the author can provide an explanation.

Kang-ChangWoo commented 1 week ago

I don't know if it will be quite helpful to you guys.

In my case, it resulted from class names VoxFormerLayer in projects > mmdet3d_plugin > voxformer > modules > encoder.py when I set batch size over 1. As the input passes through multiple layers of transformers and normalization, the batch normalization layer produces infinite values. Thus, I am using batch size 1, not a fundamental solution.