training error - Githubissues

Sense-GVT / Fast-BEV

Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline

Other

596 stars 91 forks source link

training error #13

Open guoqi-code opened 1 year ago

guoqi-code commented 1 year ago

when train utils epoch 8, an error occurs: /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [1488,0,0], thread: [32,0,0] Assertion input_val >= zero && input_val <= one failed.

ymlab commented 1 year ago

Unfortunately, this is a training instability problem that has plagued us for a long time, and we currently have no good way to avoid this error. Some feasible methods in practice are to restart the training after resume, or restart the training after adjusting the learning rate.

ymlab commented 1 year ago

Any suggestions for solving this issue are welcome.

pengcheng001 commented 1 year ago

I have encountered similar problems when training other models. There is a high probability that there are illegal data less than 0 in the gt value. For example, the width and height of bbox are less than 0. It is very likely that when doing data aug, the appropriate gt filter out.

justttry commented 1 year ago

I have encountered similar problems when training other models. There is a high probability that there are illegal data less than 0 in the gt value. For example, the width and height of bbox are less than 0. It is very likely that when doing data aug, the appropriate gt filter out.

Did you fix this bug? Any suggestion?

Mandylove1993 commented 1 year ago

when train utils epoch 8, an error occurs: /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [1488,0,0], thread: [32,0,0] Assertion input_val >= zero && input_val <= one failed.

Are you using distributed training？

huichen98 commented 1 year ago

我取出数据集前1000帧，打开或者关闭增强，训练都没有报错，但是用作者提供的整个数据集训练，就会出现错误

huichen98 commented 1 year ago

config --> fp16 = dict(loss_scale="dynamic")

thfylsty commented 1 year ago

关闭Resnet18的torch vision 权重加载，极大的概率减小训崩的情况。

WxlSky commented 7 months ago

/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1670,0,0], thread: [9,0,0] Assertioninput_val >= zero && input_val <= one` failed. Traceback (most recent call last): File "tools/train.py", line 279, in main() File "tools/train.py", line 268, in main

` I have tested all the solutions above, but the problem still exists. any other solutions ?

HaoZhang16613 commented 7 months ago

Anyone solved this problem?