Sense-GVT / Fast-BEV

Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline
Other
568 stars 85 forks source link

CUDA error: device-side assert triggered #49

Open Rango-Zhang-Hang opened 1 year ago

Rango-Zhang-Hang commented 1 year ago

Thank you for this great work! I followed the instructions and used the nuscenesv1.0 full dataset. But when I run the training code, as I tried multiple times, it always has this error at around epoch 1 [14000/20000]. I was using the provided '.pkl' files to train, so I wonder if anyone also met this problem. I read online that the reason is the inconsistency between the label and the output, but this error appeared during the training process, not at the very first beginning. Thus it is very wired to me.

I attached the report:

,0,0], thread:[9,0,0] Assertion input val >= zero && input val <= one" failed.40/1836opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator): block:00,0], thread: [10,0,0] Assertion input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block: 00,0], thread:[11,0,0] Assertion "input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:[00,0, thread:[12,0,0] ssertion input val >= zero && input val <= one failed.opt/conda/conda-bld/pytorch 1616554799289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block;[00,0l, thread:[13,0,0] Assertion input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:;0,0,0], thread:[14,0,0] Assertion input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block;[00,0l, thread: 15,0,0 Assertion input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block:0,0,l, thread:[16,0,0] Assertion "input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:[0,0,0], thread:[17,0,0] Assertion "input val >= zero && input val <= one" failed.Traceback (most recent call last):
File"tools/train.py",line 279,in <module>
main
File"tools/train.py",line 275,in mainmeta=meta)File"/home nfs/xxx/hang/mmdetection3d/mmdet3d/apis/train.py", line 191, in train model
meta=meta)
File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/apis/train.py", line 159,in train detectorrunner.run(data loaders , cfe.workflow)

File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/torch/nn/modules/module.py3
line 889,in call implresult= self.forward(*input,**kwargs)File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/mmcv/runner/fp16 utils,py"
line 128,in new funcoutput = old func(*new args,**new kwargs)File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/detectors/fastbev.py", line 294, in forwardreturn self.forward train(img,img metas,**kwargs)File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/detectors/fastbev,py", line 312, in forward train
loss_det = self.bbox head.loss(*x, gt bboxes 3d, gt labels 3d, img metas)File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/mmcv/runner/fp16 utils,py"
line 214,in new funcoutput = old func(*new args,**new kwargs)File "/home nfs/xxx/hang/mmdetection3d/mdet3d/models/dense heads/free anchor3d head.py",line 234,in loss
positive losses.append(self.positive bag loss(matched cls prob, matched box prob))File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/dense heads/free anchor3d head,py", line 272,in positive bag loss
bag prob,torch.ones like(bag prob),reduction='none')File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/torch/nn/functional.py",line 2762,in binary cross entropy
return torch.C.nn.binary cross entropy(input, target, weight, reduction enum)RuntimeError: CUDA error: device-side assert triggeredAborted (core dumped)
Mandylove1993 commented 1 year ago

I meet this error too!,Have you resolved?

Rango-Zhang-Hang commented 1 year ago

I meet this error too!,Have you resolved?

Sadly no, have u?

silvercherry commented 1 year ago

have you solve this problem?

huichen98 commented 1 year ago

我也一样报错

ycdhqzhiai commented 3 months ago

把fp16注释掉,我也遇见了一样的问题,加入fp16 = dict(loss_scale='dynamic'),虽然没有这个问题,但是训练过程中,grad_norm: nan

LaCandela commented 1 month ago

I am having the same problem...