Open NataliaShmueli opened 7 months ago
Actually, if possible, adding it to the validate feature might serve even better
To be clear, you're running into this during mfa align with the pretrained japanese model, right? Where the pronunciation for 私 chosen is w a t a k ɯ ɕ i
rather than w a t a ɕ i
? This wouldn't be an issue in the tokenizer, which would likely be outputting ワタシ as the pronunciation every time (I think), but the aligner only uses Sudachipy's pronunciations when the word isn't in the dictionary, so this would more be from picking the wrong pronunciation variant independent of any tokenization. If you want to force the ワタシ pronunciation, you can modify the pronunciation dictionary and remove the ワタクシ variants (since the dictionary is overspecified for variation). I'll double check the japanese training variation for 私 once I have all the existing models updated to 3.0.
It's not an issue with the MFA model, rather, Sudachipy.
Example: The transcription is "私もあなたもうんざりしていたのだから" The speaker says "ワタシ モ アナタ モ ウンザリ シ テ イ タ ノ ダ カラ" But Sudachipy is outputting "ワタクシ モ アナタ モ ウンザリ シ テ イ タ ノ ダ カラ"
It tends to be fairly accurate, but I don't know why it's doing it for 私 only?
Edit:
Ah, another issue. In this instance, 妻 would be read as サイ.
Is your feature request related to a problem? Please describe. Even with the Tokenizer, sometimes there's inaccuracy in the output such as 私 being read by the Japanese tokenizer as ワタクシ instead of the speaker's ワタシ. Having it be able to output the dictionary before I run the model would help a lot so I can double check everything for more accurate resulrs
Describe the solution you'd like For higher quality results from the G2P I'd love to be able to add language tags (i.e. --language thai) to
mfa g2p
so I can edit the dictionary further.Describe alternatives you've considered So far the results are very, very good, but I still wish to double check everything going into the model.