MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.3k stars 243 forks source link

Force aligning against a supplied string of phones (instead of using a dictionary) #577

Closed rbracco closed 1 year ago

rbracco commented 1 year ago

Is your feature request related to a problem? Please describe. This feature may already exist, but I'm looking for a way to use a model I've pretrained to force-align against a specific string of phonemes, instead of looking up the phones in the dictionary using the English text. So if the sentence is "The white dog" instead of MFA looking up each word in the dictionary and then force aligning against "ðə waɪt dɒɡ", i'd rather be able to align against any phone string, including some that aren't actual words, examples: ðə waɪk dɒɡ, ðə waɪt dɡ ...etc

Describe the solution you'd like A way to do something like mfa align AUDIO_FILE PHONE_LABEL ACOUSTIC_MODEL_PATH, also it doesn't need to be exposed in the API at this top level, if I can go in to the code and do this manually somehow that would suffice.

Describe alternatives you've considered I could make a very simple dictionary with a 1 to 1 mapping of the words to the phones I want, but this would be very tedious and not scalable in any way.

mmcauliffe commented 1 year ago

You could probably approximate it by having the text for the file be "ðə waɪt dɒɡ" and then specifying your own dictionary as (I feel like automating this wouldn't be too much additional effort if you're creating the phone string of utterance text anyway, right?):

ðə ð ə
waɪt w aɪ t (you might want aj here instead of aɪ if you're using english_mfa model)
dɒɡ d ɒ ɡ
etc

or just a dictionary like

ð ð
ə ə
w w
aɪ aɪ
t t
d d
ɒ ɒ
ɡ ɡ

with texts of "ð ə w aɪ t d ɒ ɡ", but then you lose word information that you might want?

That's probably the easiest way, the integration with lexicons is a pretty deep assumption throughout the alignment code, though I have been playing around with other ways of generating the utterance graph that don't use it (instead using an integrated g2p model as the lexicon), so it should be doable, but I'd have to think about the best way to invoke that functionality.

rbracco commented 1 year ago

Thank you, that clarifies things a lot. I'll try the escape hatches you suggested and reopen if there's any further questions.