MontrealCorpusTools / mfa-models

Collection of pretrained models for the Montreal Forced Aligner
Creative Commons Attribution 4.0 International
103 stars 19 forks source link

Request for more phonemic-style UK English dictionary and acoustic model #32

Open praat-enthusiast opened 4 months ago

praat-enthusiast commented 4 months ago

I recently posted in the discussion section (see below) relating to the phone set used by MFA and asking if anyone had developed a tool for returning to a more standard phonemic transcription after alignment. I'm still interested in getting to this, as it would be very useful for our project where variants will be differentiated using auditory and acoustic methods at a later stage in the research. Having now experimented a little more with MFA and become more familiar with the dictionary (english_uk_mfa.dict) and phonological rules for the model (english_mfa) I was using, I think that while a lot of what I'd like is possible to change using script after alignment, some of the issues are embedded in the dictionary and acoustic model.

Specifically, I've found that the rules for TH-alveolarization (allowing /θ/ -> /s/ and /ð/ -> /z/) and ING-variation are overgeneralized, so that it becomes very difficult to group together variants of these phones in order to study them together at a later stage of research. As an example, a speaker in a test file that I've aligned is often transcribed as using TH-alveolarization by the aligner (e.g. 'there' as [zɛː], 'third' as [sɜː]). Auditorily I think that this speaker actually is producing dental fricatives, but the way that they are treated by the aligner means that if at a later stage someone wanted to use the corpus to consider TH-alveolarization, TH-stopping, or TH-fronting in our data, it would be very difficult to find all cases where this might occur. Similarly, we might consider ING-variation later, and in our data auditorily-identified variants include not only [ɪn] and [ɪŋ], but also [ɪŋk], but currently the aligner transcribes a most of these as [ɪn], making it harder to find potential variable cases later in the research.

It occurs to me that it's likely that a previous MFA version included a model for UK English trained with a dictionary that used a less opinionated phone set, included fewer pronunciation variants in the dictionary, and did not use phonological rules that are difficult to reverse. If this is the case, I was wondering if this is something you'd be willing to share. I understand the motivation for the current more opinionated phone set and phonological rules, but when applying existing dictionaries and acoustic models to new data and non-standard varieties, I think a lot of people based in sociolinguistics/sociophonetics would find it very useful to have access to a dictionary and acoustic model that produce a more phonemic-style output. This is available for American English in the form of the ARPA dictionary, but not for other varieties.

Thanks in advance for your time and help with this.

Discussed in https://github.com/MontrealCorpusTools/mfa-models/discussions/29

Originally posted by **praat-enthusiast** February 28, 2024 I'm aware that recent versions of MFA IPA dictionaries follow the opinionated phone set laid out [here](https://mfa-models.readthedocs.io/en/latest/mfa_phone_set.html), which produces a more allophonic transcription. However, I'm based in sociolinguistics and for the project I'm currently working on we would be quite interested to end up with more phonemic or broad phonetic transcription (essentially a version with all of the rules described [here](https://mfa-models.readthedocs.io/en/latest/mfa_phone_set.html) reversed). I was wondering if anyone has a version of the IPA dictionary for UK English which doesn't have the rules described implemented, or has already created a script of some kind to get back to a more standard phonemic transcription after alignment? As I understand it, the current acoustic models have been trained with the dictionaries that use the opinionated phone set, and the allophonic detail in the models and dictionaries improves the alignment, so we would likely be aligning using this phone set and then trying to revert back to a phonemic transcription afterwards. If anyone has attempted this, or has access to a previous version of the dictionary which doesn't implement the new phone set, I'd really appreciate it if you'd be willing to share this with me! I believe that a more phonemic-like dictionary once existed for the US English IPA dictionary at least, as it appears to be mentioned [here](https://memcauliffe.com/bootstrapping-an-ipa-dictionary-for-english-using-montreal-forced-aligner-20.html). If no one has attempted this already, I'm planning to write a script that will reverse-engineer the rules and produce a more phonemic transcription - I'll share it here if this attempt is successful!