Open nemtiax opened 5 years ago
I've been using "screen" rather than tmux and so far it's working fine for me. I'm running on an Azure VM, so I:
ssh into the VM
start screen ("screen -S openseq" or similar)
run the docker container
start training ("python run.py...")
Detaching/reattaching the screen, or logging out of the VM entirely, does not affect training. Hope this helps!
I'm trying to train the ds2_small_1gpu model on an AWS p3.2xlarge instance. After setting everything up, I start a tmux session and run
python run.py --config_file=example_configs/speech2text/ds2_small_1gpu.py --mode=train_eval
Everything seems to work fine unless my connection to the machine drops. At that point, no further output, either to the screen or to the checkpoints or logging files, is produced. Even reattaching to the tmux session does not result in progress resuming.
While stuck in this way, the python process seems to use a full cpu core (~100% CPU% according to
top
), which seems to suggest some sort of deadlock?I'm unsure whether the root cause of this issue is due to some feature of OpenSeq2Seq, or some other component in my set-up, but I've never encountered this behavior with other deep learning frameworks on AWS, so I figured I'd start here.
I would welcome any suggestions for additional diagnostic steps I should take to help pin down the problem.