Traceback (most recent call last):
File "tools/run_net.py", line 44, in
main()
File "tools/run_net.py", line 25, in main
launch_job(cfg=cfg, init_method=args.init_method, func=train)
File "f:\uestc\code\uniformerv2-main\slowfast\utils\misc.py", line 296, in launch_job
torch.multiprocessing.spawn(
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 202, in start_processes
while not context.join():
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 74, in _wrap
fn(i, *args)
File "f:\uestc\code\uniformerv2-main\slowfast\utils\multiprocessing.py", line 60, in run
ret = func(cfg)
File "F:\uestc\code\UniFormerV2-main\tools\train_net.py", line 497, in train
train_epoch(
File "F:\uestc\code\UniFormerV2-main\tools\train_net.py", line 108, in train_epoch
misc.check_nan_losses(loss)
File "f:\uestc\code\uniformerv2-main\slowfast\utils\misc.py", line 33, in check_nan_losses
raise RuntimeError("ERROR: Got NaN losses {}".format(datetime.now()))
RuntimeError: ERROR: Got NaN losses 2023-10-26 02:28:52.148789
模型刚开loss为6左右,训练了多个epoch后loss变为nan,看了一下是模型的输出preds有nan,输入也没有问题,尝试了多个学习率和weight_decay都没有解决,像请教一下是什么原因,可能的解决方法有哪些呢
Traceback (most recent call last): File "tools/run_net.py", line 44, in
main()
File "tools/run_net.py", line 25, in main
launch_job(cfg=cfg, init_method=args.init_method, func=train)
File "f:\uestc\code\uniformerv2-main\slowfast\utils\misc.py", line 296, in launch_job
torch.multiprocessing.spawn(
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 202, in start_processes
while not context.join():
File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "E:\APP\Anaconda3\envs\timesformer\lib\site-packages\torch\multiprocessing\spawn.py", line 74, in _wrap fn(i, *args) File "f:\uestc\code\uniformerv2-main\slowfast\utils\multiprocessing.py", line 60, in run ret = func(cfg) File "F:\uestc\code\UniFormerV2-main\tools\train_net.py", line 497, in train train_epoch( File "F:\uestc\code\UniFormerV2-main\tools\train_net.py", line 108, in train_epoch misc.check_nan_losses(loss) File "f:\uestc\code\uniformerv2-main\slowfast\utils\misc.py", line 33, in check_nan_losses raise RuntimeError("ERROR: Got NaN losses {}".format(datetime.now())) RuntimeError: ERROR: Got NaN losses 2023-10-26 02:28:52.148789 模型刚开loss为6左右,训练了多个epoch后loss变为nan,看了一下是模型的输出preds有nan,输入也没有问题,尝试了多个学习率和weight_decay都没有解决,像请教一下是什么原因,可能的解决方法有哪些呢