Closed Opdoop closed 1 year ago
Right, the Japanese models and dictionary are still under construction since I just got access to the LaboroTV corpora last week and have been cleaning it up and expanding the dictionary for including it in training. The current Japanese MFA dictionary is available here: https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary/japanese/mfa/v2.0.0, but eventually it'll be released like the other languages once it has pronunciation and silence probabilities.
Hi @mmcauliffe , is there any update on this? I tried checking the link you posted but I can't find the .dict file. Is there also work on a g2p/acoustic model? Thanks!
@mmcauliffe Also, is there any plan in the horizon to release not only a Japanese to IPA dict, but an acoustic model as well? I'd like to use MFA if possible, but if there are no plans in the horizon I will need to find alternative tools to do this.
I wouldn't mind trying to train a MFA acoustic model myself, though I don't know if Japanese will be particularly problematic, or if there are any extra steps I should do given the lack of spaces in the language.
So I have a decent Japanese model trained, I think, just doing some verification to make sure it's working as well as I'd like. So should be a Japanese acoustic model, dictionary and g2p on the horizon (hopefully very soon!).
The only issue that I'm still unsure of is the word segmentation, since that was done external to MFA via nagisa but then I modified and corrected the transcripts pretty extensively to make proper phonological words (merging things like 使っ た
back to 使った
and things that feel more like words to me). I'd like to retrain a nagisa model, but I'd need to generate POS tags for the training data. But the acoustic model and dictionary should at least be a good starting point.
@mmcauliffe Thanks for replying! I'm looking forward to try it once it's available.
By the way, is there a way to do only g2p for some input text (no audio) with the MFA command line? For other languages I was simply loading the dict and looking for words myself (ignoring for now probabilities and such), but here I would also depend on word segmentation to do that. I'm also not sure about what would happen when finding a word not in the g2p model.
@mmcauliffe Hi Michael! Thats great news! Do you have any ballpark eta for the acoustic model and dictionary? I'm also very eager to try it out once its available. Thanks!
By the way, is there a way to do only g2p for some input text (no audio) with the MFA command line? For other languages I was simply loading the dict and looking for words myself (ignoring for now probabilities and such), but here I would also depend on word segmentation to do that. I'm also not sure about what would happen when finding a word not in the g2p model.
@leandro-gracia-gil Yep, the mfa g2p
command can take a corpus path as input and generate pronunciations for all words in there (though it would overlap with the dictionary, I'll think about a way to filter it down to just OOVs). The other route would be to do mfa validate
on the corpus which will generate an OOVs file in the temporary directory that you can use as input to mfa g2p
.
@MicaelLemelin I'm shooting for end of November/early December if I can get the current issue resolved (at the moment it's matching a ton of silence to the final "vowel" of です and ます, even though it should just be deleted, but once I figure out, it should be good to go since everything else looks decent.
That's really great news. Can't wait to see it! Thank you so much! @mmcauliffe
Thanks for the great work! Is there any update about the Japanese acoustic models?
Ok I fell down the rabbit hole trying to clean the laboroTV corpus, but I have some cool stuff in the pipeline for speaker classification/diarization. Current plan is to shelve the laboro cleaning, get a decent model trained and released with:
I think I have most issues worked out with it, just trained a candidate model, so I'll see how the alignments from it look and correct any data issues still remaining in the corpora above before releasing.
Thank you very much for the update! Looking forward to try the model once it's ready.
Ok I fell down the rabbit hole trying to clean the laboroTV corpus, but I have some cool stuff in the pipeline for speaker classification/diarization. Current plan is to shelve the laboro cleaning, get a decent model trained and released with:
- Common Voice 12
- GlobalPhone
- JVS
- Microsoft's Japanese corpus
- TEDxJP-10K
I think I have most issues worked out with it, just trained a candidate model, so I'll see how the alignments from it look and correct any data issues still remaining in the corpora above before releasing.
Thanks for all the works you have done! Your models really help me a lot in my research.
Ok Japanese acoustic model, dictionary, and g2p model are up now, so I'll close this out.
They're looking reasonably decent from my spot checks. I'll keep working on the LaboroTV corpus to get a model trained with that in the near future, but there is a lot of noise both in terms of the recordings, segmentation, and the tokenization from nagisa that I'm basing the "word" segmentation on (words popping up like "まそれ", "えその" that contain hesitations merged with the following words).
I'm still thinking about best way to provide some tokenization model trained on the corrections that I've done, since it isn't straightforwardly what nagisa outputs (nagisa: "使 って" vs mfa: "使って"). To retrain nagisa, I'd need to get some POS tags going, but I may be able to scrape a reasonably good dictionary for tagging via wikipron/wiktionary. This is largely outside of MFA, but might tie into using richer dictionaries with better morphological tagging/parsing rather than viewing each word as morphologically opaque and a completely unrelated to all other words.
Ok Japanese acoustic model, dictionary, and g2p model are up now, so I'll close this out.
They're looking reasonably decent from my spot checks. I'll keep working on the LaboroTV corpus to get a model trained with that in the near future, but there is a lot of noise both in terms of the recordings, segmentation, and the tokenization from nagisa that I'm basing the "word" segmentation on (words popping up like "まそれ", "えその" that contain hesitations merged with the following words).
I'm still thinking about best way to provide some tokenization model trained on the corrections that I've done, since it isn't straightforwardly what nagisa outputs (nagisa: "使 って" vs mfa: "使って"). To retrain nagisa, I'd need to get some POS tags going, but I may be able to scrape a reasonably good dictionary for tagging via wikipron/wiktionary. This is largely outside of MFA, but might tie into using richer dictionaries with better morphological tagging/parsing rather than viewing each word as morphologically opaque and a completely unrelated to all other words.
Wow thank you so much!! This is really useful for my research, as well :D
mfa version: 2.0.0rc4 run
mfa models download dictionary japanese_mfa
on Ubuntu shows:I try to download dic from release paga but the page shows 404.