NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

Training seems to freeze when disconnecting from terminal #473

Open nemtiax opened 5 years ago

nemtiax commented 5 years ago

I'm trying to train the ds2_small_1gpu model on an AWS p3.2xlarge instance. After setting everything up, I start a tmux session and run python run.py --config_file=example_configs/speech2text/ds2_small_1gpu.py --mode=train_eval

Everything seems to work fine unless my connection to the machine drops. At that point, no further output, either to the screen or to the checkpoints or logging files, is produced. Even reattaching to the tmux session does not result in progress resuming.

While stuck in this way, the python process seems to use a full cpu core (~100% CPU% according to top), which seems to suggest some sort of deadlock?

I'm unsure whether the root cause of this issue is due to some feature of OpenSeq2Seq, or some other component in my set-up, but I've never encountered this behavior with other deep learning frameworks on AWS, so I figured I'd start here.

I would welcome any suggestions for additional diagnostic steps I should take to help pin down the problem.

scottbouma commented 4 years ago

I've been using "screen" rather than tmux and so far it's working fine for me. I'm running on an Azure VM, so I:

Detaching/reattaching the screen, or logging out of the VM entirely, does not affect training. Hope this helps!