MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

MFA Validate Inconsistent Output #782

Open shreeshailgan opened 3 months ago

shreeshailgan commented 3 months ago

I am running mfa validate on the LibriTTS-train-clean-460 dataset using an IPA dictionary I have. The output contains:

WARNING  288196total OOV tokens       

However, in the generated oov_counts.txt file that is generated (see snapshot below), the sum of the counts in the 2nd column is 32,905. Shouldn't these two numbers be equal? If not, what does 288,196 represent?

--and   151
phoenix 104
--the   99
--a 88
--i 77
--but   67
ion 65
...
mmcauliffe commented 3 months ago

Are you passing configuration options that remove punctuation symbols? What's the full command you're running and what version are you on?

shreeshailgan commented 3 months ago

MFA version montreal-forced-aligner 3.0.1 pyhd8ed1ab_0 conda-forge

Full command mfa validate /path/to/data/ /path/to/lexicon --ignore_acoustics --num_jobs 48