Closed stefanocoretta closed 9 months ago
By default, apostrophes are treated as clitic markers, so you'd have remove that via a line like:
clitic_markers:
If "ç'vjen" should be treated the same as "ç vjen" then that should work, but I'm assuming that it's not pronounced the same as "ç" by itself would usually be the name of the letter (though maybe it's always /tʃ/ according to https://en.wiktionary.org/wiki/%C3%A7#Pronunciation_2, since usually there's a vowel component for letter names). You can add "ç'" to the pronunciation dictionary with its pronunciation:
ç' tʃ
See https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/dictionary.html#text-normalization for more information.
Thanks! It should be treated as ç'
and vjen
. I thought I initially just used the defaults but the tokeniser was not splitting ç'
and vjen
, but rather just giving ç'vjen
and complain that ç'vjen
was OOV even if ç'
and vjen
were in the dictionary. But I was using version 2.2.4 so maybe it was that!
I tried today with 3.0.0, and indeed it worked as intended (it split ç'vjen
into ç'
and vjen
)! :)
Thanks for the clarification!
Debugging checklist
[x] Have you updated to latest MFA version? ~I could only get 2.2.4 to work on my Mac Book~ Same issue with 3.0.0a8 [x] Have you tried rerunning the command with the
--clean
flag?Describe the issue
When running
validate
, OOV words found because strings containing'
likeç'vjen
are not split despite having set a custom list ofword_break_markers
For Reproducing your issue Please fill out the following:
ç'vjen
Log file No error, just not applying the configuration
Desktop (please complete the following information):
Additional context
This is the
config.yaml
file I am using