LiyaoTang / contrastBoundary

Contrastive Boundary Learning for Point Cloud Segmentation (CVPR2022)
MIT License
139 stars 11 forks source link

(unhandled cuda error, NCCL version 2.8.2) Error occurs. #28

Closed daehyeonjeon closed 1 year ago

daehyeonjeon commented 1 year ago

/tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [102,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [103,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [104,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [105,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [106,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [107,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [108,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [109,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [110,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [111,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [112,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [113,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [55,0,0], thread: [114,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [32,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [33,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [34,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [35,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [36,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [37,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [38,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [39,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [40,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [41,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [137,0,0], thread: [42,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [35,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [36,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [37,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [38,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [39,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [40,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [41,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [42,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [43,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [44,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [45,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [46,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /tmp/pip-req-build-vhjlnmnc/aten/src/ATen/native/cuda/IndexKernel.cu:135: operator(): block: [136,0,0], thread: [47,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.2 ncclUnhandledCudaError: Call to CUDA function failed. Traceback (most recent call last): File "exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 503, in main() File "exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 127, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 247, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes while not context.join(): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 166, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/root/workplace/dhjeon/contrastBoundary-master/pytorch/exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 263, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, criterion2, optimizer, epoch) File "/root/workplace/dhjeon/contrastBoundary-master/pytorch/exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 326, in train loss = criterion(output, target, stage_list) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, *kwargs) File "/root/workplace/dhjeon/contrastBoundary-master/pytorch/model/pointtransformer_seg.py", line 24, in forward loss_list += self.contrast_head(output, target, stage_list) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/root/workplace/dhjeon/contrastBoundary-master/pytorch/model/heads.py", line 251, in forward loss = self.main_contrast(n, i, stage_list, target) File "/root/workplace/dhjeon/contrastBoundary-master/pytorch/model/heads.py", line 222, in point_contrast if not torch.any(point_mask): RuntimeError: CUDA error: device-side assert triggered

/opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 56 leaked semaphores to clean up at shutdown len(cache)) /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 56 leaked semaphores to clean up at shutdown len(cache)) /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 56 leaked semaphores to clean up at shutdown len(cache))

The above mentioned error occurs in the contrast boundary loss. I checked that the code is working on 'origin_4gpu.yaml'. Please give me a solution to this.

LiyaoTang commented 1 year ago

Hi, it seems that the buffer for knn-search is not enough. Have you tried to re-compile the pointops in this repo?

daehyeonjeon commented 1 year ago

안녕하세요, knn-search용 버퍼가 충분하지 않은 것 같습니다. 이 저장소에서 pointops를 다시 컴파일하려고 했습니까?

I guess the previously compiled one was wrong Thank you for your reply.