MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.35k stars 250 forks source link

[BUG] Word break markers options not respected #751

Closed stefanocoretta closed 9 months ago

stefanocoretta commented 9 months ago

Debugging checklist

[x] Have you updated to latest MFA version? ~I could only get 2.2.4 to work on my Mac Book~ Same issue with 3.0.0a8 [x] Have you tried rerunning the command with the --clean flag?

Describe the issue

When running validate, OOV words found because strings containing ' like ç'vjen are not split despite having set a custom list of word_break_markers

For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in? Albanian
    • How many files/speakers? Testing with 1 speaker 1 file containing ç'vjen
    • Are you using lab files or TextGrid files for input? Lab files
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one? No
    • If it's a custom dictionary, what is the phoneset? IPA
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one?
    • If it's a model you've trained, what data was it trained on?

Log file No error, just not applying the configuration

Desktop (please complete the following information):

Additional context

This is the config.yaml file I am using

word_break_markers: [" ", "?", "!", "(", ")", ",", ",", ".", ":", ";", "¡", "¿", "?", "“", "„", "”", "&", "~", "%", "#", "—", "…", "‥", "、", "。", "【", "】", "$", "+", "=", "〝", "〟", "″", "‹", "›", "«", "»", "・", "⟨", "⟩", "「", "」", "『", "』", "”", "'"]
mmcauliffe commented 9 months ago

By default, apostrophes are treated as clitic markers, so you'd have remove that via a line like:

clitic_markers:

If "ç'vjen" should be treated the same as "ç vjen" then that should work, but I'm assuming that it's not pronounced the same as "ç" by itself would usually be the name of the letter (though maybe it's always /tʃ/ according to https://en.wiktionary.org/wiki/%C3%A7#Pronunciation_2, since usually there's a vowel component for letter names). You can add "ç'" to the pronunciation dictionary with its pronunciation:

ç'  tʃ

See https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/dictionary.html#text-normalization for more information.

stefanocoretta commented 9 months ago

Thanks! It should be treated as ç' and vjen. I thought I initially just used the defaults but the tokeniser was not splitting ç' and vjen, but rather just giving ç'vjen and complain that ç'vjen was OOV even if ç' and vjen were in the dictionary. But I was using version 2.2.4 so maybe it was that!

I tried today with 3.0.0, and indeed it worked as intended (it split ç'vjen into ç' and vjen)! :)

Thanks for the clarification!