The main process can't quit when I finished the training.

python -m torch.distributed.launch --nproc_per_node=8 tools/train_net.py --config-file configs/pretrain/seg_rec_poly_fuse_feature.yaml

I train the model with 8 GPUS. At the last epoch, the main GPU is halted. I have to press"ctrl+c" to quit the program. Information is as the following:

File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main process.wait() File "/usr/local/lib/python3.7/subprocess.py", line 1019, in wait return self._wait(timeout=timeout) File "/usr/local/lib/python3.7/subprocess.py", line 1653, in _wait (pid, sts) = self._try_wait(0) File "/usr/local/lib/python3.7/subprocess.py", line 1611, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags)

Thanks for giving any suggestion.

MhLiao / MaskTextSpotterV3

The main process can't quit when I finished the training. #49