Closed ramdrop closed 2 years ago
Hi, thanks for your interest.
Have you tried running the program with debug flag, or the CUDA_LAUNCH_BLOCKING=1 as suggested?
running with debug flag&CUDA_LAUNCH_BLOCKING=1 seems to give a more detailed log:
/opt/conda/conda-bld/pytorch_1623448265233/work/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [252,0,0], thread: [91,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/LOCAL2/ramdrop/apps/anaconda3/envs/pt/lib/python3.7/site-packages/torch/nn/functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/c10/core/TensorImpl.h:1156.)
return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 452, in <module>
main()
File "exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 128, in main
main_worker(args.train_gpu, args.ngpus_per_node, args)
File "exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 261, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, optimizer, epoch)
File "exp/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1/train.py", line 323, in train
loss = criterion(output, target, stage_list)
File "/LOCAL2/ramdrop/apps/anaconda3/envs/pt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/LOCAL2/ramdrop/github/point_registration/contrastBoundary/pytorch/model/pointtransformer_seg.py", line 24, in forward
loss_list += self.contrast_head(output, target, stage_list)
File "/LOCAL2/ramdrop/apps/anaconda3/envs/pt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/LOCAL2/ramdrop/github/point_registration/contrastBoundary/pytorch/model/heads.py", line 251, in forward
loss = self.main_contrast(n, i, stage_list, target)
File "/LOCAL2/ramdrop/github/point_registration/contrastBoundary/pytorch/model/heads.py", line 191, in point_contrast
labels = get_subscene_label(n, i, stage_list, target, self.nstride, self.config.num_classes) # (m, ncls) - distribution / onehot
File "/LOCAL2/ramdrop/github/point_registration/contrastBoundary/pytorch/model/basic_operators.py", line 14, in get_subscene_label
return get_subscene_features(stage_n, stage_i, stage_list, x, nstride, **kwargs)
File "/LOCAL2/ramdrop/github/point_registration/contrastBoundary/pytorch/model/basic_operators.py", line 42, in get_subscene_features
x = x[neighbor_idx, :].view(p_to.shape[0], kr, x.shape[1]) # (m, kr, ncls)
RuntimeError: CUDA error: device-side assert triggered
Ahh, I found the problem, I created the conda environment and compiled pointops
in point-transformer repo. But after I compiled the pointops
in your repo, I found the training works.
Thanks for your quick reply!
Cheers.
Hi, thanks for open sourcing your wonderful work, but I encounterd a few errors when trying to reproduce you results. Could you please help debug this issue?
Environment Ubuntu 18.04, PyTorch 1.9.0, CUDA 11.1, A100. I installed a conda environment by following point-transformer.
Launch run:
bash tool/train.sh s3dis origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1
Config Since I don't have 4 gpu, I modifed the gpu, batch_size and works parameters in
pytorch/config/s3dis/origin_multi-Ua-concat-latent_contrast-Ua-softnn-latent-label-l2-w.1.yaml
accordingly:Outputs
full log: train-20220530_205425.log