All initial words are marked as OOV words

MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi

https://montrealcorpustools.github.io/Montreal-Forced-Aligner/

MIT License

1.29k stars 242 forks source link

All initial words are marked as OOV words #646

Open jpereznavarro opened 1 year ago

jpereznavarro commented 1 year ago

I'm trying to align two corpora, one in Spanish (using spanish_spain_mfa) and one in Basque (basque_cv). I'm working on each language separately, but in both cases I have the issue that all transcript initial words are marked as OOV items (spn), even if they are in their respective dictionaries. I'm working with "normal" .wav (22.05 KHz) and .txt (UTF-8). Padding the ends of the .wav file does not help solve this issue. Any ideas of what I might be doing wrong? Thanks a lot!

jpereznavarro commented 1 year ago

Could it be the case that my trancript files (.txt) have the following format: "The cat was sitting on the mat" With just one utterance per transcript and no speaker information, so that the 1st word might be recognized as speaker info?

mmcauliffe commented 1 year ago

What is the command you're running? Does it specify --ignore_case false in some way (or via a configuration file)?

jpereznavarro commented 1 year ago

I'm using mfa validate corpus_dir spanish_spain_mfa spanish_spain_mfa

Could this be because of the .txt files? This issue does not occur when the transcripts are in TextGrid format.

mmcauliffe commented 1 year ago

If this is still an issue, can you attach one of the text files that is having this issue so I can debug further?