CUDA error: device-side assert triggered

Rango-Zhang-Hang commented 1 year ago

Thank you for this great work! I followed the instructions and used the nuscenesv1.0 full dataset. But when I run the training code, as I tried multiple times, it always has this error at around epoch 1 [14000/20000]. I was using the provided '.pkl' files to train, so I wonder if anyone also met this problem. I read online that the reason is the inconsistency between the label and the output, but this error appeared during the training process, not at the very first beginning. Thus it is very wired to me.

I attached the report:

,0,0], thread:[9,0,0] Assertion input val >= zero && input val <= one" failed.40/1836opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator): block:00,0], thread: [10,0,0] Assertion input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block: 00,0], thread:[11,0,0] Assertion "input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:[00,0, thread:[12,0,0] ssertion input val >= zero && input val <= one failed.opt/conda/conda-bld/pytorch 1616554799289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block;[00,0l, thread:[13,0,0] Assertion input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:;0,0,0], thread:[14,0,0] Assertion input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block;[00,0l, thread: 15,0,0 Assertion input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block:0,0,l, thread:[16,0,0] Assertion "input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:[0,0,0], thread:[17,0,0] Assertion "input val >= zero && input val <= one" failed.Traceback (most recent call last):
File"tools/train.py"，line 279，in <module>
main
File"tools/train.py"，line 275，in mainmeta=meta)File"/home nfs/xxx/hang/mmdetection3d/mmdet3d/apis/train.py", line 191, in train model
meta=meta)
File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/apis/train.py", line 159,in train detectorrunner.run(data loaders , cfe.workflow)

File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/torch/nn/modules/module.py3
line 889，in call implresult= self.forward(*input，**kwargs)File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/mmcv/runner/fp16 utils,py"
line 128，in new funcoutput = old func(*new args，**new kwargs)File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/detectors/fastbev.py", line 294, in forwardreturn self.forward train(img，img metas，**kwargs)File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/detectors/fastbev,py", line 312, in forward train
loss_det = self.bbox head.loss(*x, gt bboxes 3d, gt labels 3d, img metas)File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/mmcv/runner/fp16 utils,py"
line 214，in new funcoutput = old func(*new args，**new kwargs)File "/home nfs/xxx/hang/mmdetection3d/mdet3d/models/dense heads/free anchor3d head.py",line 234,in loss
positive losses.append(self.positive bag loss(matched cls prob, matched box prob))File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/dense heads/free anchor3d head,py", line 272,in positive bag loss
bag prob，torch.ones like(bag prob)，reduction='none')File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/torch/nn/functional.py",line 2762，in binary cross entropy
return torch.C.nn.binary cross entropy(input, target, weight, reduction enum)RuntimeError: CUDA error: device-side assert triggeredAborted (core dumped)