I train the model with 8 GPUS. At the last epoch, the main GPU is halted. I have to press"ctrl+c" to quit the program. Information is as the following:
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
process.wait()
File "/usr/local/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/usr/local/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/local/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
python -m torch.distributed.launch --nproc_per_node=8 tools/train_net.py --config-file configs/pretrain/seg_rec_poly_fuse_feature.yaml
I train the model with 8 GPUS. At the last epoch, the main GPU is halted. I have to press"ctrl+c" to quit the program. Information is as the following:
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
process.wait()
File "/usr/local/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/usr/local/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/local/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
Thanks for giving any suggestion.