OpenRobotLab / EmbodiedScan

[CVPR 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
https://tai-wang.github.io/embodiedscan/
Apache License 2.0
395 stars 26 forks source link

[Bug] CUDA error: an illegal memory access was encountered #26

Closed yxchng closed 2 months ago

yxchng commented 3 months ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA H100 80GB HBM3
CUDA_HOME: /fs/applications/cuda/12.1.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
PyTorch: 2.2.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu121
OpenCV: 4.9.0
MMEngine: 0.10.3
MMDetection: 3.3.0
MMDetection3D: 1.4.0+
spconv2.0: False

Reproduces the problem - code sample

-

Reproduces the problem - command or script

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py

Reproduces the problem - error message

[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14a547a35d87 in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libc10.so)frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14a5479e675f in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libc10.so)frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14a547b068a8 in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x14a548bd93ac in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x14a548bdd4c8 in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x14a548be0bfa in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x14a548be1839 in /home/user/cache/conda/envs/embodiedscan/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)frame #7: <unknown function> + 0xdbbf4 (0x14a5928edbf4 in /home/user/cache/conda/envs/embodiedscan/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x14a59edab1ca in /lib64/libpthread.so.0)frame #9: clone + 0x43 (0x14a59e28de73 in /lib64/libc.so.6)

Additional information

I sometimes run in to CUDA error: an illegal memory access was encountered. Do you happen to know what might be the cause?

Tai-Wang commented 3 months ago

It can be caused by the OOM (out-of-memory) problem. What kind of GPU do you use? Maybe you can try to reduce the batch size in the config (such as here).

yxchng commented 3 months ago

@Tai-Wang H100 80GB. It only occurs sometimes. Does memory usage sometimes shoot up above 80GB?

Tai-Wang commented 3 months ago

OK, it is a little strange. We use A100 to train the model but observe about ~30G memory usage typically. You may try to reduce the batch size though to check whether it is caused by OOM first.

mrsempress commented 3 months ago

When I ran it, I encountered the same problem, but I encountered it on the 6 epoch.

04/02 05:42:05 - mmengine - INFO - Epoch(train)  [6][150/501]  base_lr: 5.0000e-04 lr: 5.0000e-04  eta: -1 day, 21:18:02  time: 3.6373  data_time: 0.4038  memory: 29048  grad_norm: 19.5125  loss: 8.7404  loss_cls: 0.9937  loss_bbox: 0.4470  d0.loss_cls: 1.0453  d0.loss_bbox:
 0.4450  d1.loss_cls: 1.0229  d1.loss_bbox: 0.4460  d2.loss_cls: 1.0076  d2.loss_bbox: 0.4439  d3.loss_cls: 1.0010  d3.loss_bbox: 0.4465  d4.loss_cls: 0.9956  d4.loss_bbox: 0.4460
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
    self.run_iter(idx, data_batch)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
    outputs = self.runner.model.train_step(
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
    results = self(**data, mode=mode)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 666, in forward
    return self.loss(inputs, data_samples, **kwargs)
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 507, in loss
    losses = self.bbox_head.loss(**head_inputs_dict,
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 637, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 668, in loss_by_feat
    losses_cls, losses_bbox = multi_apply(
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 711, in loss_by_feat_single
    cls_reg_targets = self.get_targets(cls_scores_list,
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 258, in get_targets
    pos_inds_list, neg_inds_list) = multi_apply(self._get_targets_single,
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 398, in _get_targets_single
    assign_result = self.assigner.assign(
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/task_modules/assigners/hungarian_assigner.py", line 119, in assign
    cost = cost.detach().cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
yxchng commented 3 months ago

@mrsempress I also only encounter this in the middle, randomly. Sometimes it can run complete without error, but random crashing is very annoying.

Tai-Wang commented 3 months ago

Thanks for your feedback. It might be related to #29 as well. Welcome more feedback regarding such problems from the community. We may collect more cases to analyze the possible reasons.

Tai-Wang commented 3 months ago

I also encounter this problem more frequently when using the data info with more complex prompts. There are several solutions that may alleviate this problem:

  1. Reduce the batch size, e.g., from 12 to 6 or 8
  2. Add time.sleep(0.01) before cost = cost.detach().cpu() to reduce the stress of GPU
  3. Reduce the num_workers

Such solutions may not completely avoid such problems, but they should work to reduce the frequency of encountering them.

Tai-Wang commented 2 months ago

Another trick that I have tried is to reduce the num_queries, for example, to 100. It can also significantly reduce the burden when doing matching and computing the costs.

henryzhengr commented 2 months ago

Another trick that I have tried is to reduce the num_queries, for example, to 100. It can also significantly reduce the burden when doing matching and computing the costs.

Will it cause any performance drops?

Tai-Wang commented 2 months ago

Another trick that I have tried is to reduce the num_queries, for example, to 100. It can also significantly reduce the burden when doing matching and computing the costs.

Will it cause any performance drops?

It has limited influence on performance. My AP@0.25 increases, and AP@0.5 decreases slightly with num_queries=100 and max_text_length=512 (vs. our provided baseline num_queries=256 and max_text_length=256).