Open Amano-Hina opened 1 year ago
我的程序是在Ubuntu系统下运行的,而且需要两块以上的GPU才能训练,你这的问题不知道是不是由于Windows系统所导致的
好的,我大概知道了。报错基本都改好了,谢谢
main_train_swinfusion.py修改epoch如下: for epoch in range(10000): # keep running for i, train_data in enumerate(train_loader):
但训练1000iter之后出现报错: 23-05-06 14:31:05.347 : <epoch: 1, iter: 10,200, lr:2.000e-05> G_loss: 2.721e+00 Text_loss: 7.241e-01 Int_loss: 1.415e-01 SSIM_loss: 1.855e+00 23-05-06 14:34:08.261 : <epoch: 3, iter: 10,400, lr:2.000e-05> G_loss: 3.314e+00 Text_loss: 6.116e-01 Int_loss: 2.091e-01 SSIM_loss: 2.493e+00 23-05-06 14:37:10.746 : <epoch: 5, iter: 10,600, lr:2.000e-05> G_loss: 2.738e+00 Text_loss: 5.933e-01 Int_loss: 2.515e-01 SSIM_loss: 1.893e+00 23-05-06 14:40:13.426 : <epoch: 7, iter: 10,800, lr:2.000e-05> G_loss: 3.229e+00 Text_loss: 7.836e-01 Int_loss: 1.896e-01 SSIM_loss: 2.256e+00 23-05-06 14:43:16.382 : <epoch: 9, iter: 11,000, lr:2.000e-05> G_loss: 2.725e+00 Text_loss: 5.621e-01 Int_loss: 1.276e-01 SSIM_loss: 2.035e+00 23-05-06 14:43:16.382 : Saving the model. Save path is:Model/Infrared_Visible_Fusion/Infrared_Visible_Fusion/models/11000_E.pth
Traceback (most recent call last):
File "main_train_swinfusion.py", line 260, in
这是加载数据集的时候出现了问题 你的确保你A_dir 和B_dir里面的文件数量是否一样,文件命名是否一致
好的,我大概知道了。报错基本都改好了,谢谢
你好,我在训练的时候也遇到的同样的问题,请问具体是怎么解决的嘞?
请问解决了,我也遇到了同样的问题
好的,我大概知道了。报错基本都改好了,谢谢
你好,我在训练的时候也遇到的同样的问题,请问具体是怎么解决的嘞?
请问解决了吗,我在windows系统下也遇到了同样的问题
你好,请问一下你的机器是什么?为什么能训练得这么快?
(swinfus) PS D:\Python\SwinFusion-master> python -m torch.distributed.launch --nproc_per_node=3 --master_port=1234 main_train_swinfusion.py --opt options/swinir/train_swinfusion_vif.json --dist True NOTE: Redirects are currently not supported in Windows or MacOs. C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects
--local-rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中,该 请求的地址无效。). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [nj.baidupcs.com]:1234 (system error: 10049 - 在其上下文中,该 请求的地址无效。). usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_swinfusion.py: error: unrecognized arguments: --local-rank=0 usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_swinfusion.py: error: unrecognized arguments: --local-rank=2 usage: main_train_swinfusion.py [-h] [--opt OPT] [--launcher LAUNCHER] [--local_rank LOCAL_RANK] [--dist DIST] main_train_swinfusion.py: error: unrecognized arguments: --local-rank=1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 23612) of binary: C:\Anaconda3\envs\swinfus\python.exe Traceback (most recent call last): File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Anaconda3\envs\swinfus\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 196, in
main()
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 192, in main
launch(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launch.py", line 177, in launch
run(args)
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Anaconda3\envs\swinfus\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_train_swinfusion.py FAILED
Failures: [1]: time : 2023-04-10_18:38:44 host : LAPTOP-0K7VHP1C rank : 1 (local_rank: 1) exitcode : 2 (pid: 21040) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-04-10_18:38:44 host : LAPTOP-0K7VHP1C rank : 2 (local_rank: 2) exitcode : 2 (pid: 25444) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-04-10_18:38:44 host : LAPTOP-0K7VHP1C rank : 0 (local_rank: 0) exitcode : 2 (pid: 23612) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
错误日志如上所示,请问其原因是什么,解决方法是怎样的呢?谢谢