MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.35k stars 249 forks source link

No longer able to align using phonemes directly as inputs #804

Closed leandro-gracia-gil closed 5 months ago

leandro-gracia-gil commented 6 months ago

I have been using mfa align to generate the alignments of audio with input IPA phonemes directly instead of text. This was done by using a handmade dictionary that simply maps IPA phonemes to themselves. The reason for this is that my use case forces me to do G2P separately in my own way, although ensuring that the produced phonemes are supported by the MFA acoustic model.

However, after updating from version 2.x to 3.x (in particular, 3.0.7), I'm seeing that mfa align now attempts to do a text tokenization step that is modifying my input IPA phonemes and affecting the alignment results.

Here's an example with Japanese text (好きにする):

(I got these tokenizer results by checking tokenization/japanese.py in the installed MFA package code while debugging the issue)

Is there any way to bypass the tokenizer and align using my input phonemes directly?

  1. Corpus structure
    • What language is the corpus in? Japanese
    • How many files/speakers? For now, this is just a single speaker test to check things work.
    • Are you using lab files or TextGrid files for input? Input text files with IPA phonemes directly.
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one? current phonemes should come from japanese_mfa v3.0.0
    • If it's a custom dictionary, what is the phoneset?
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one? japanese_mfa v3.0.0
    • If it's a model you've trained, what data was it trained on?

Log file No log files were generated, since the problem does not manifest as a runtime error.

leandro-gracia-gil commented 6 months ago

Note: this example is for Japanese, but I expect to do the same (feeding phonemes as input) in a few other latin script languages. I haven't checked yet if these are also affected by the same issue.

leandro-gracia-gil commented 6 months ago

Also, one thing I had to fix while debugging. I can open a separate bug if needed.

In file tokenization/japanese.py, line 19:

config_path = resource_dir.joinpath("japanese", "sudachi_config.json")

This fails later because config_path is a pathlib object, which is not supported by sudachipy. It can be easily fixed by forcing a conversion to string.

config_path = str(resource_dir.joinpath("japanese", "sudachi_config.json"))
mmcauliffe commented 6 months ago

You can download the old 2.0 Japanese model via mfa download acoustic japanese_mfa --version 2.0.1a --force (see https://mfa-models.readthedocs.io/en/latest/acoustic/Japanese/Japanese%20MFA%20acoustic%20model%20v2_0_1a.html). The 3.0 Japanese model uses sudachipy's tokenization for input text and assumes it's normal Japanese kana/kanji/romaji, which is why i is getting mapped to アイ and IPA specific symbols are ignored.

leandro-gracia-gil commented 6 months ago

I see, thanks. Regardless from the tokenization issue, is there any other new feature, or improvement in quality I would be missing by using the old 2.0.1a model instead of the 3.0.0 one?

Also, since the 3.0.0 model uses text + tokenization, is it trying to align with all possible pronunciations (as in different phonemes with different probabilities for a same word in a dict) and picking the best match, or rather using some criteria to pick the most likely pronunciation first and then attempt to align with it?

leandro-gracia-gil commented 5 months ago

I'm closing this bug as the previous 2.0.1a model can still be used in this way. Thanks for your help.