Training seems to freeze when disconnecting from terminal

I'm trying to train the ds2_small_1gpu model on an AWS p3.2xlarge instance. After setting everything up, I start a tmux session and run python run.py --config_file=example_configs/speech2text/ds2_small_1gpu.py --mode=train_eval

Everything seems to work fine unless my connection to the machine drops. At that point, no further output, either to the screen or to the checkpoints or logging files, is produced. Even reattaching to the tmux session does not result in progress resuming.

While stuck in this way, the python process seems to use a full cpu core (~100% CPU% according to top), which seems to suggest some sort of deadlock?

I'm unsure whether the root cause of this issue is due to some feature of OpenSeq2Seq, or some other component in my set-up, but I've never encountered this behavior with other deep learning frameworks on AWS, so I figured I'd start here.

I would welcome any suggestions for additional diagnostic steps I should take to help pin down the problem.

NVIDIA / OpenSeq2Seq

Training seems to freeze when disconnecting from terminal #473