MontrealCorpusTools / mfa-models

Collection of pretrained models for the Montreal Forced Aligner
Creative Commons Attribution 4.0 International
103 stars 19 forks source link

inconsistencies in Armenian dictionary #9

Open jhdeov opened 2 years ago

jhdeov commented 2 years ago

Hello,

On the MFA page for Armenian, it seems the dictionary is being based off of Armenian transliteration instead of transcriptions.

If you want, I can re-transcribe your dictionary file using a mix of Wiktionary + my native judgments.

mmcauliffe commented 2 years ago

Yeah that would be great! @echodroff and @emilyahn are the creators of that using the XPF system. I know they were updating some lexicons a while back for VoxCommunis, but it looks like Armenian hasn't been updated.

It looks like Wikipron has scraped Wiktionary for Eastern Armenian and Western Armenian, so that might be an easier start point if the Wiktionary pronunciations are a good basis. If you go this route, I'd be happy to host that as an MFA phoneset dictionary and train a corresponding model (or host one that you train).

jhdeov commented 2 years ago

The Wiktionary data is pretty reliable (I cleaned it up in 2021). However, the two dialects have radically different transcriptions for the same word. Basically any voiced plosive in Eastern, is voicless aspirated in Western; while any voiceless unaspirated plosive in Eastern is voiced in Western. So using separate models (a hye one and hyw one) may be wise. A complication though is that the Vox recordings had both hye and hyw speakers pooled together. Perhaps I can ask the maker of the recordings (who I know) if I can break up their corpus into the two dialects? I can also provide audio archives of both hye and hyw speech if the Vox corpus isn't big enough.

I'm new to MFA (still doing the tutorials). When you say "that route", do you mean just taking all the Wiktionary words as the pronunciation dictionary and then you guys re-run the models? I'm happy to help in any way, even if it's just correcting the existing pronunciation dictionary that you guys have.

mmcauliffe commented 2 years ago

Yeah, so what I did for all the MFA dictionaries/acoustic models was:

  1. Download the scraped Wikipron dictionaries per dialect
  2. Run a clean up script with basic rules for cleaning up ligatures, some narrow diacritics, tone markings, etc so that the resulting dictionary uses a mostly standardized set of symbols across languages. The clean up rules are per dialectal dictionary.
  3. Create G2P models based on the dictionary (and fix any random typo/pronunciation errors in the dictionary using the phones symbol table)
  4. Create a speaker-dictionary mapping that has speakers assigned to specific dialect dictionaries and a "default" dictionary that contains all pronunciations for speakers that have no information about what the dialect is.
  5. Run a validation script to get a list of OOVs in the corpora using the initial speaker-dictionary mapping.
  6. Run a G2P generation script to supplement the original dictionary (and re run the speaker-dictionary mapping script to regenerate the default dictionary with the new pronunciations)
  7. Train acoustic model using the speaker-dictionary mapping

So I would say, as long as the speaker information is somewhere in the corpus, it should be possible to generate the speaker-dictionary mapping (that's what I've done with Common Voice and other corpora), and then have a fall back that contains all variants for speakers that are not specified.

In terms of data, Common Voice has 2 hours of data, and that's the only one I can find with Armenian data on OpenSLR and open-speech-corpora, so getting as much data from other sources that you know of would make for a much better acoustic model.

(I'll also try to expand this walkthrough into an actual docs page as a concrete example for end-to-end training)

echodroff commented 2 years ago

Thank you for flagging this, and yes, we were planning on updating that model. (Michael, if you're already on this, let us know!) Ultimately our goal is to have good G2P for the given audio recording for downstream phonetic analysis. If Eastern and Western Armenian are very different, and it's possible to split the Common Voice dataset into two dialects, that would be great. It looks like participants did not report their accent at least in the Common Voice v7 version, but perhaps that's something the Common Voice folks could sort out post-hoc or for the future.

echodroff commented 2 years ago

Also, reading your original comment more closely, it looks like we will struggle to add the schwas with XPF alone. (The affricate issue was a missing line in our Python script.) I think the Wikipron route might be preferable at this point; the bigger problem now is figuring out which dialect to use for each recording given the lack of metadata. I wonder if we could build some dialect classifier given this information about stop voicing / run a first pass that allows for both pronunciations in the lexicon.

jhdeov commented 2 years ago

Schwa: Yeah... knowing the schwa requires a mix of phonology and morphology info. It's a pain to predict...

Classifier: The maker of the Vox corpus does have some guidelines for doing dialect splits. I can provide a list of 'rules' that distinguish the dialects -- like the above voicing difference and other.

Metadata: For Vox, because it's only 2 hours, I could potentially just listen to the recordings and provide the metadata myself on whether some sentence is hye or hyw. I remember I provided audio recordings for it. But I don't know how the insides work (like where I can listen to each recording and provide metadata). I just emailed the Vox maker for this now.

PS: I emailed Michael before you first commented, offering some lists of potential audio corpora to use. I'm mostly just unsure what are the minimum corpus-annotation requirements for MFA. Like can it be orthography-less, transcription-less, etc?