Broken link to Japanese MFA dictionary v2.0.0

Opdoop commented 2 years ago

mfa version: 2.0.0rc4 run mfa models download dictionary japanese_mfa on Ubuntu shows:

RemoteModelNotFoundError: Could not find a model named "japanese_mfa" for dictionary. Available: russian_mfa, 
  mandarin_taiwan_mfa, mandarin_mfa, mandarin_erhua_mfa, mandarin_china_mfa, german_mfa,    french_mfa, english_us_mfa, 
  english_uk_mfa, english_nigeria_mfa, english_mfa, czech_mfa, swedish_mfa,    portuguese_portugal_mfa, portuguese_mfa, 
  portuguese_brazil_mfa, polish_mfa, korean_mfa, korean_jamo_mfa,    vietnamese_mfa, vietnamese_hue_mfa, 
  vietnamese_ho_chi_minh_city_mfa, vietnamese_hanoi_mfa, ukrainian_mfa, turkish_mfa,    thai_mfa, swahili_mfa, 
  spanish_spain_mfa, spanish_mfa, spanish_latin_america_mfa, hausa_mfa, croatian_mfa,    bulgarian_mfa, vietnamese_cv, 
  uzbek_cv, uyghur_cv, urdu_cv, ukrainian_cv, turkish_cv, thai_cv, tatar_cv, tamil_cv,    swedish_cv, sorbian_upper_cv, 
  russian_cv, romanian_cv, punjabi_cv, portuguese_cv, polish_cv, mandarin_pinyin,    maltese_cv, kyrgyz_cv, kurmanji_cv,
   kazakh_cv, italian_cv, indonesian_cv, hungarian_cv, hindi_cv, guarani_cv,    greek_cv, german_prosodylab, 
  georgian_cv, french_prosodylab, english_us_arpa, dutch_cv, czech_cv, chuvash_cv,    bulgarian_cv, belarusian_cv, 
  basque_cv, bashkir_cv, armenian_cv, and abkhaz_cv. You can see all available models either on 
  https://mfa-models.readthedocs.io/en/latest/ or    https://github.com/MontrealCorpusTools/mfa-models/releases. If 
  you're looking for a model from 1.0, please see    
  https://github.com/MontrealCorpusTools/mfa-models/releases/tag/dictionary-archive-v1.0.

I try to download dic from release paga but the page shows 404.

mmcauliffe commented 2 years ago

Right, the Japanese models and dictionary are still under construction since I just got access to the LaboroTV corpora last week and have been cleaning it up and expanding the dictionary for including it in training. The current Japanese MFA dictionary is available here: https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary/japanese/mfa/v2.0.0, but eventually it'll be released like the other languages once it has pronunciation and silence probabilities.

lesterphillip commented 2 years ago

Hi @mmcauliffe , is there any update on this? I tried checking the link you posted but I can't find the .dict file. Is there also work on a g2p/acoustic model? Thanks!

leandro-gracia-gil commented 1 year ago

@mmcauliffe Also, is there any plan in the horizon to release not only a Japanese to IPA dict, but an acoustic model as well? I'd like to use MFA if possible, but if there are no plans in the horizon I will need to find alternative tools to do this.

I wouldn't mind trying to train a MFA acoustic model myself, though I don't know if Japanese will be particularly problematic, or if there are any extra steps I should do given the lack of spaces in the language.

mmcauliffe commented 1 year ago

So I have a decent Japanese model trained, I think, just doing some verification to make sure it's working as well as I'd like. So should be a Japanese acoustic model, dictionary and g2p on the horizon (hopefully very soon!).

The only issue that I'm still unsure of is the word segmentation, since that was done external to MFA via nagisa but then I modified and corrected the transcripts pretty extensively to make proper phonological words (merging things like 使った back to 使った and things that feel more like words to me). I'd like to retrain a nagisa model, but I'd need to generate POS tags for the training data. But the acoustic model and dictionary should at least be a good starting point.

leandro-gracia-gil commented 1 year ago

@mmcauliffe Thanks for replying! I'm looking forward to try it once it's available.

By the way, is there a way to do only g2p for some input text (no audio) with the MFA command line? For other languages I was simply loading the dict and looking for words myself (ignoring for now probabilities and such), but here I would also depend on word segmentation to do that. I'm also not sure about what would happen when finding a word not in the g2p model.

MicaelLemelin commented 1 year ago

@mmcauliffe Hi Michael! Thats great news! Do you have any ballpark eta for the acoustic model and dictionary? I'm also very eager to try it out once its available. Thanks!

mmcauliffe commented 1 year ago

By the way, is there a way to do only g2p for some input text (no audio) with the MFA command line? For other languages I was simply loading the dict and looking for words myself (ignoring for now probabilities and such), but here I would also depend on word segmentation to do that. I'm also not sure about what would happen when finding a word not in the g2p model.

@leandro-gracia-gil Yep, the mfa g2p command can take a corpus path as input and generate pronunciations for all words in there (though it would overlap with the dictionary, I'll think about a way to filter it down to just OOVs). The other route would be to do mfa validate on the corpus which will generate an OOVs file in the temporary directory that you can use as input to mfa g2p.

@MicaelLemelin I'm shooting for end of November/early December if I can get the current issue resolved (at the moment it's matching a ton of silence to the final "vowel" of です and ます, even though it should just be deleted, but once I figure out, it should be good to go since everything else looks decent.

AlexandaJerry commented 1 year ago

That's really great news. Can't wait to see it! Thank you so much! @mmcauliffe

ChuhanWang10 commented 1 year ago

Thanks for the great work! Is there any update about the Japanese acoustic models？

mmcauliffe commented 1 year ago

Ok I fell down the rabbit hole trying to clean the laboroTV corpus, but I have some cool stuff in the pipeline for speaker classification/diarization. Current plan is to shelve the laboro cleaning, get a decent model trained and released with:

I think I have most issues worked out with it, just trained a candidate model, so I'll see how the alignments from it look and correct any data issues still remaining in the corpora above before releasing.

leandro-gracia-gil commented 1 year ago

Thank you very much for the update! Looking forward to try the model once it's ready.

AlexandaJerry commented 1 year ago

Ok I fell down the rabbit hole trying to clean the laboroTV corpus, but I have some cool stuff in the pipeline for speaker classification/diarization. Current plan is to shelve the laboro cleaning, get a decent model trained and released with:

Common Voice 12

GlobalPhone

JVS

Microsoft's Japanese corpus

TEDxJP-10K

I think I have most issues worked out with it, just trained a candidate model, so I'll see how the alignments from it look and correct any data issues still remaining in the corpora above before releasing.

Thanks for all the works you have done! Your models really help me a lot in my research.

mmcauliffe commented 1 year ago

Ok Japanese acoustic model, dictionary, and g2p model are up now, so I'll close this out.

They're looking reasonably decent from my spot checks. I'll keep working on the LaboroTV corpus to get a model trained with that in the near future, but there is a lot of noise both in terms of the recordings, segmentation, and the tokenization from nagisa that I'm basing the "word" segmentation on (words popping up like "まそれ", "えその" that contain hesitations merged with the following words).

I'm still thinking about best way to provide some tokenization model trained on the corrections that I've done, since it isn't straightforwardly what nagisa outputs (nagisa: "使って" vs mfa: "使って"). To retrain nagisa, I'd need to get some POS tags going, but I may be able to scrape a reasonably good dictionary for tagging via wikipron/wiktionary. This is largely outside of MFA, but might tie into using richer dictionaries with better morphological tagging/parsing rather than viewing each word as morphologically opaque and a completely unrelated to all other words.

vlb9ae commented 1 year ago

Ok Japanese acoustic model, dictionary, and g2p model are up now, so I'll close this out.

They're looking reasonably decent from my spot checks. I'll keep working on the LaboroTV corpus to get a model trained with that in the near future, but there is a lot of noise both in terms of the recordings, segmentation, and the tokenization from nagisa that I'm basing the "word" segmentation on (words popping up like "まそれ", "えその" that contain hesitations merged with the following words).

I'm still thinking about best way to provide some tokenization model trained on the corrections that I've done, since it isn't straightforwardly what nagisa outputs (nagisa: "使って" vs mfa: "使って"). To retrain nagisa, I'd need to get some POS tags going, but I may be able to scrape a reasonably good dictionary for tagging via wikipron/wiktionary. This is largely outside of MFA, but might tie into using richer dictionaries with better morphological tagging/parsing rather than viewing each word as morphologically opaque and a completely unrelated to all other words.

Wow thank you so much!! This is really useful for my research, as well :D

MontrealCorpusTools / mfa-models

Broken link to Japanese MFA dictionary v2.0.0 #4