Occasionally bug in training HMM

BrianMozy commented 1 month ago

I met a bug that when training HMM, occasionally the initialisation will fail, showing error like this:

2024-10-07 15:43:58 INFO osl-dynamics [hmm.py:470:random_state_time_course_initialization]: Initialization 0
2024-10-07 15:43:58 INFO osl-dynamics [hmm.py:977:set_random_state_time_course_initialization]: Setting random means and covariances
/gpfs3/well/woolrich/users/vxd741/osl-dynamics/osl_dynamics/models/hmm.py:644: RuntimeWarning: invalid value encountered in divide
  phi_interim = np.sum(xi, axis=0).reshape(
2024-10-07 15:48:46 ERROR osl-dynamics [hmm.py:303:fit]: Training failed!
2024-10-07 15:48:46 INFO osl-dynamics [hmm.py:470:random_state_time_course_initialization]: Initialization 1
2024-10-07 15:48:46 INFO osl-dynamics [hmm.py:977:set_random_state_time_course_initialization]: Setting random means and covariances
2024-10-07 15:53:32 ERROR osl-dynamics [hmm.py:303:fit]: Training failed!
2024-10-07 15:53:32 INFO osl-dynamics [hmm.py:470:random_state_time_course_initialization]: Initialization 2
2024-10-07 15:53:32 INFO osl-dynamics [hmm.py:977:set_random_state_time_course_initialization]: Setting random means and covariances
2024-10-07 15:58:31 ERROR osl-dynamics [hmm.py:303:fit]: Training failed!
2024-10-07 15:58:31 ERROR osl-dynamics [hmm.py:488:random_state_time_course_initialization]: Initialization failed

However, with the same training config, most of the time everything is just fine. And please kindly find my hmm training script attached.

train_hmm.txt

scho97 commented 1 month ago

I encountered similar problem, too, but during the main training, not initialisations.

With HMM, sometimes poor initialisations may lead to some states having near-zero or zero posterior probabilities. One way to confirm this is to print out state fractional occupancies after the training. If there is a state that has zero or very low fractional occupancy, then this is the case. Maybe try reducing the sequence_length, since it can alleviate numerical underflow problem.

If you can try using different random seeds and identify which one reproduces this error, we could debug it better.

cgohil8 commented 1 month ago

It might be due to the initial covariances being chosen should that you're more sensitive to what Sungjun described.

Can you run a couple tests:

Skip the initialisation (i.e. comment it out), does the same error arise during the main training?
This is the method that sets the means and covariances: https://github.com/OHBA-analysis/osl-dynamics/blob/main/osl_dynamics/models/hmm.py#L969. What's the fractional occupancy of the sampled state time course when the error arises?

OHBA-analysis / osl-dynamics

Occasionally bug in training HMM #291