MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.35k stars 249 forks source link

Language Tag for G2P and/or Validate #785

Open NataliaShmueli opened 7 months ago

NataliaShmueli commented 7 months ago

Is your feature request related to a problem? Please describe. Even with the Tokenizer, sometimes there's inaccuracy in the output such as 私 being read by the Japanese tokenizer as ワタクシ instead of the speaker's ワタシ. Having it be able to output the dictionary before I run the model would help a lot so I can double check everything for more accurate resulrs

Describe the solution you'd like For higher quality results from the G2P I'd love to be able to add language tags (i.e. --language thai) to mfa g2p so I can edit the dictionary further.

Describe alternatives you've considered So far the results are very, very good, but I still wish to double check everything going into the model.

NataliaShmueli commented 7 months ago

Actually, if possible, adding it to the validate feature might serve even better

mmcauliffe commented 7 months ago

To be clear, you're running into this during mfa align with the pretrained japanese model, right? Where the pronunciation for 私 chosen is w a t a k ɯ ɕ i rather than w a t a ɕ i? This wouldn't be an issue in the tokenizer, which would likely be outputting ワタシ as the pronunciation every time (I think), but the aligner only uses Sudachipy's pronunciations when the word isn't in the dictionary, so this would more be from picking the wrong pronunciation variant independent of any tokenization. If you want to force the ワタシ pronunciation, you can modify the pronunciation dictionary and remove the ワタクシ variants (since the dictionary is overspecified for variation). I'll double check the japanese training variation for 私 once I have all the existing models updated to 3.0.

NataliaShmueli commented 7 months ago

It's not an issue with the MFA model, rather, Sudachipy.

Example: The transcription is "私もあなたもうんざりしていたのだから" The speaker says "ワタシ モ アナタ モ ウンザリ シ テ イ タ ノ ダ カラ" But Sudachipy is outputting "ワタクシ モ アナタ モ ウンザリ シ テ イ タ ノ ダ カラ"

It tends to be fairly accurate, but I don't know why it's doing it for 私 only?

Edit:

Ah, another issue. In this instance, 妻 would be read as サイ. image