OpenGVLab / UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
https://arxiv.org/abs/2211.09552
Apache License 2.0
294 stars 19 forks source link

loss nan #52

Closed leehkk closed 11 months ago

leehkk commented 1 year ago

Traceback (most recent call last): File "tools/run_net.py", line 44, in main() File "tools/run_net.py", line 25, in main launch_job(cfg=cfg, init_method=args.init_method, func=train) File "f:\uestc\code\uniformerv2-main\slowfast\utils\misc.py", line 296, in launch_job torch.multiprocessing.spawn( File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 246, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 202, in start_processes while not context.join(): File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 163, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 74, in _wrap fn(i, *args) File "f:\uestc\code\uniformerv2-main\slowfast\utils\multiprocessing.py", line 60, in run ret = func(cfg) File "F:\uestc\code\UniFormerV2-main\tools\train_net.py", line 497, in train train_epoch( File "F:\uestc\code\UniFormerV2-main\tools\train_net.py", line 108, in train_epoch misc.check_nan_losses(loss) File "f:\uestc\code\uniformerv2-main\slowfast\utils\misc.py", line 33, in check_nan_losses raise RuntimeError("ERROR: Got NaN losses {}".format(datetime.now())) RuntimeError: ERROR: Got NaN losses 2023-10-26 02:28:52.148789 模型刚开loss为6左右,训练了多个epoch后loss变为nan,看了一下是模型的输出preds有nan,输入也没有问题,尝试了多个学习率和weight_decay都没有解决,像请教一下是什么原因,可能的解决方法有哪些呢

Andy1621 commented 1 year ago

loss nan原因很多,一般是学习率过大,可以调小一些,warmup epoch拉长一些。 最简单的解决方案是关闭混合精度,或者使用bf16混合精度。