cmusphinx / sphinxtrain

Acoustic model trainer for CMU Sphinx
Other
178 stars 112 forks source link

Fix stalled Queue::POSIX training #40

Closed dhdaines closed 2 years ago

dhdaines commented 2 years ago

Fixes #33 - the problem wasn't actually TiedWaitForConvergence(), which quite innocently sits around reading log files until the end of time. This is still arguably not a great idea, but the problem was actually norm_and_launch_bw.pl which was "waiting" for baum_welch.pl jobs which had stopped living and become mixed-up zombies.

This is a consequence of trying to fit a grid scheduler workflow onto processes, which is quite imperfect, because of the absolute need to wait() or waitpid() on a process from its parent rather than from whatever process might be depending on its completion. The parent (which is slave_convg.pl) wasn't doing that. So I made it explicitly do that.

You might think that just setting a global SIGCHLD handler before launching the darn things would solve the problem but... sigh for whatever reason, that doesn't work. And setting a handler after launching them is obviously wrong.

This shouldn't really slow down training because we have to wait for them all somewhere anyway. It should also be okay for batch scheduling systems though it might be a bit slow.