train: HmmTopoloy::Check(), entry with no corresponding phones

Sixtease commented 3 years ago

Hello. While training, I get the following error (with context) in train/mono/log:

/home/user/Documents/MFA/thirdparty/bin/gmm-init-mono --shared-phones=/home/user/Documents/MFA/train/dictionary/phones/sets.int --train-feats=scp:/home/user/Documents/MFA/train/corpus_data/subset_2000/features_mfcc_cmvn_deltas.0.scp /home/user/Documents/MFA/train/dictionary/topo 39 /home/user/Documents/MFA/train/mono/0.mdl /home/user/Documents/MFA/train/mono/tree 
ERROR (gmm-init-mono[5.5]:Check():hmm/hmm-topology.cc:245) HmmTopoloy::Check(), entry with no corresponding phones.
kaldi::KaldiFatalError

Where could I be making a mistake?

mmcauliffe commented 3 years ago

Hmm, I've never seen this error pop up before. Could you give some more context?

What sort of data are you running this on?
What language/phoneset, etc?
Are you on the latest version of 2.0 (pip install montreal-forced-aligner -U).
Have you tried running the validator on the dataset to see if there's any issues that it flags (mfa validate, https://montreal-forced-aligner.readthedocs.io/en/latest/data_validation.html)?

Thanks!

Sixtease commented 3 years ago

Dear @mmcauliffe,

I am running the aligner's train routine on about 1600 hours of Czech speech data. All sentences have a non-empty transcript and the pronunciation dictionary covers 100% of the words. I use these phones:

a aa aw b c ch d dj dz dzh e ee ew f g h i ii j k l m mg n ng nj o oo ow p r rsh rzh s sh t tj u uu v x z zh

I use version 2.0.0a2 installed by pip install montreal-forced-aligner

I have run the validator and there was no error on STDOUT, just success messages. There was an error on STDERR though, the validator in principle crashed with (methinks) the same error as the one reported above.

By the way, I have a hard time squeezing all the files (about 1M) into one directory. Is there some way to have several training directories?

Thank you for the support.

mmcauliffe commented 3 years ago

Hmm, interesting, could you try the upgrade to get the latest version (2.0.0a5), and see if you still get the issue? Also maybe try running it with the --clean flag just to make sure no temporary files are getting reused?

In general, I recommend per speaker directories, that's the default mode. MFA will walk through the directory specified and collect all pairs of sound/transcription files. You can also do a flatter structure with -s and specify the number of characters at the beginning that correspond to speaker code, but that's a bit more brittle I find compared to just organizing speakers into their own directories.

Sixtease commented 3 years ago

I have tried with default config, latest version (-U). I am still getting weird errors with no useful search results.

Here is the content of train_and_align.log:

2021-03-02 08:01:34,904 - train_and_align - INFO - Setting up corpus information...
2021-03-02 12:10:49,902 - train_and_align - DEBUG - Parsed corpus directory with 3 jobs in 14954.997523069382 seconds
2021-03-02 12:10:49,928 - train_and_align - INFO - Number of speakers in corpus: 7, average number of utterances per speaker: 58618.142857142855
2021-03-02 12:10:49,928 - train_and_align - INFO - Number of speakers in corpus: 7, average number of utterances per speaker: 58618.142857142855
2021-03-02 12:10:50,607 - train_and_align - INFO - Parsing dictionary without pronunciation probabilties without silence probabilties
2021-03-02 12:10:56,257 - train_and_align - INFO - Creating dictionary information...
2021-03-02 12:11:12,026 - train_and_align - INFO - Setting up training data...
2021-03-02 20:25:53,665 - train_and_align - INFO - Initializing training for mono...
2021-03-02 20:25:56,957 - train_and_align - DEBUG - Setup for initialization took 3.2923102378845215 seconds
2021-03-02 20:26:08,058 - train_and_align - INFO - Initialization complete!
2021-03-02 20:26:08,058 - train_and_align - DEBUG - Initialization took 11.100818157196045 seconds
2021-03-02 20:30:15,627 - train_and_align - INFO - Training complete!
2021-03-02 20:30:15,627 - train_and_align - DEBUG - Training took 247.568745136261 seconds
2021-03-02 20:30:15,627 - train_and_align - INFO - Generating alignments using mono models using 5000 utterances...
2021-03-02 20:30:15,627 - train_and_align - DEBUG - Using feats as the feature name
2021-03-02 20:30:56,300 - train_and_align - DEBUG - Alignment took 40.6729633808136 seconds
2021-03-02 20:30:56,300 - train_and_align - INFO - Initializing training for tri...
2021-03-02 20:32:25,972 - train_and_align - DEBUG - Setup for initialization took 89.67221426963806 seconds
2021-03-02 20:33:03,924 - train_and_align - INFO - Initialization complete!
2021-03-02 20:33:03,924 - train_and_align - DEBUG - Initialization took 37.95134615898132 seconds
2021-03-02 20:36:40,517 - train_and_align - INFO - Training complete!
2021-03-02 20:36:40,517 - train_and_align - DEBUG - Training took 216.59327840805054 seconds
2021-03-02 20:36:40,517 - train_and_align - INFO - Generating alignments using tri models using 10000 utterances...
2021-03-02 20:36:40,517 - train_and_align - DEBUG - Using feats as the feature name
2021-03-02 20:38:37,247 - train_and_align - DEBUG - Alignment took 116.72939896583557 seconds
2021-03-02 20:38:37,247 - train_and_align - INFO - Initializing training for lda...
2021-03-02 20:41:33,381 - train_and_align - DEBUG - Setup for initialization took 176.13392972946167 seconds
2021-03-02 20:41:59,824 - train_and_align - DEBUG - There were 2 kaldi processing files that had errors:
2021-03-02 20:41:59,824 - train_and_align - DEBUG - 
2021-03-02 20:41:59,824 - train_and_align - DEBUG - /home/user/Documents/MFA/train/lda/log/lda_est.log
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     /home/user/Documents/MFA/thirdparty/bin/est-lda --write-full-matrix=/home/user/Documents/MFA/train/lda/full.mat --dim=40 /home/user/Documents/MFA/train/lda/lda.mat /home/user/Documents/MFA/train/lda/lda.0.acc /home/user/Documents/MFA/train/lda/lda.1.acc /home/user/Documents/MFA/train/lda/lda.2.acc
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     ERROR (est-lda[5.5]:Read():transform/lda-estimate.cc:193) LdaEstimate::Read, dimension or classes count mismatch, 1104, 91,  vs. 0, 0
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     kaldi::KaldiFatalError
2021-03-02 20:41:59,824 - train_and_align - DEBUG - 
2021-03-02 20:41:59,824 - train_and_align - DEBUG - /home/user/Documents/MFA/train/lda/log/acc_tree.2.log
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     /home/user/Documents/MFA/thirdparty/bin/acc-tree-stats --ci-phones=1:2:3:4:5:6:7:8:9:10:11:12:13:14:15 /home/user/Documents/MFA/train/tri_ali/final.mdl 'ark,s,cs:apply-cmvn --utt2spk=ark:/home/user/Documents/MFA/train/corpus_data/subset_10000/utt2spk.2 scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/cmvn.2.scp scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/feats.2.scp ark:- |' ark:/home/user/Documents/MFA/train/tri_ali/ali.2 /home/user/Documents/MFA/train/lda/2.treeacc
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     apply-cmvn --utt2spk=ark:/home/user/Documents/MFA/train/corpus_data/subset_10000/utt2spk.2 scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/cmvn.2.scp scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/feats.2.scp ark:-
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     LOG (apply-cmvn[5.5]:main():featbin/apply-cmvn.cc:162) Applied cepstral mean normalization to 0 utterances, errors on 0
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     LOG (acc-tree-stats[5.5]:main():bin/acc-tree-stats.cc:118) Accumulated stats for 0 files, 0 failed due to no alignment, 0 failed for other reasons.
2021-03-02 20:41:59,824 - train_and_align - DEBUG -     LOG (acc-tree-stats[5.5]:main():bin/acc-tree-stats.cc:121) Number of separate stats (context-dependent states) is 0
2021-03-02 20:41:59,825 - train_and_align - DEBUG -     WARNING (acc-tree-stats[5.5]:Close():util/kaldi-io.cc:515) Pipe apply-cmvn --utt2spk=ark:/home/user/Documents/MFA/train/corpus_data/subset_10000/utt2spk.2 scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/cmvn.2.scp scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/feats.2.scp ark:- | had nonzero return status 256
2021-03-02 20:41:59,825 - train_and_align - DEBUG -     ERROR (acc-tree-stats[5.5]:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:678) TableReader: error detected closing archive 'apply-cmvn --utt2spk=ark:/home/user/Documents/MFA/train/corpus_data/subset_10000/utt2spk.2 scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/cmvn.2.scp scp:/home/user/Documents/MFA/train/corpus_data/subset_10000/feats.2.scp ark:- |'
2021-03-02 20:41:59,825 - train_and_align - DEBUG -     terminate called after throwing an instance of 'kaldi::KaldiFatalError'
2021-03-02 20:41:59,825 - train_and_align - DEBUG -     what():  kaldi::KaldiFatalError

Sixtease commented 3 years ago

I have reduced the data set and the error ceased to occur. It may have been because of very noisy training samples but unfortunately, I don't know exactly.

MontrealCorpusTools / Montreal-Forced-Aligner

train: HmmTopoloy::Check(), entry with no corresponding phones #236