MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 247 forks source link

Enable ignore_case = False #487

Closed ibleaman closed 1 year ago

ibleaman commented 2 years ago

I'm running mfa validate and mfa train on a small corpus with a dictionary file. My dictionary entries are case-sensitive, e.g., a word like Main would be defined with a different pronunciation than main. I see that one can specify ignore_case as a parameter -- but how exactly is that accomplished?

For context, I created a file named config.yaml containing one line:

ignore_case: false

and then ran mfa validate --config_path config.yaml corp lex.txt, but based on the OOV list, all words are still being converted to lowercase.

Thanks!

mmcauliffe commented 2 years ago

What version of MFA are you using and can you try it on the latest one? I'm not seeing this show up when using the xsampa test data:

https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/blob/main/tests/data/lab/xsampa.lab https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/blob/main/tests/data/dictionaries/xsampa.txt

ibleaman commented 1 year ago

@mmcauliffe My apologies for the delay! I'm returning to this now and still have the same issue.

I am using version 2.0.6, installed in the standard conda way (on Google Colab).

I've run both of these commands:

mfa validate --config_path yid_config.yaml yid_corp yid_corp/yiddish_lexicon.txt
mfa train --config_path yid_config.yaml --include_original_text yid_corp yid_corp/yiddish_lexicon.txt alignments

_yidconfig.yaml contains this line:

ignore_case: false

(I wasn't sure from the documentation whether to capitalize false but assumed from this file that I shouldn't.)

The directory _yidcorp consists of matched .TextGrid and .wav files, 2 tiers each, 1 per speaker, as well as the lexicon file. The lexicon file has words with both lowercase and uppercase characters, mapped onto their phones. The capitalization is important because I have many minimal pairs.

The resulting aligned .TextGrid files (including the original utterance text tier!) are entirely in lowercase. Both _oov_counts_yiddishlexicon.txt and _oovs_found_yiddishlexicon.txt are also all lowercase. Interestingly, words.txt (inside _/Documents/MFA/yid_corp_train_acoustic_model/dictionary/1_yiddishlexicon/) shows the correct capitalization.

Am I using the configuration file incorrectly? Please let me know what you advise. Thank you!

ibleaman commented 1 year ago

@mmcauliffe Do you have any updates on this issue? Thanks!