MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.33k stars 246 forks source link

German model/dictionary question #232

Closed jplu111 closed 2 years ago

jplu111 commented 3 years ago

Hello,

I have download MFA and able to get it to work on English but am having some trouble getting it to work on German. I have downloaded the most recent acoustic model for German and dictionary but I get the following message:

"There were phones in the dictionary that do not have acoustic models: &1, @0, W1, a1, i1\, k\, n\, |1"

Is there a way around this?

Thank you.

mmcauliffe commented 3 years ago

Which German model did you download? If you downloaded the dictionary from https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html#available-pronunciation-dictionaries, make sure you're using the prosodylab german model (https://github.com/MontrealCorpusTools/mfa-models/raw/master/acoustic/german_prosodylab.zip). I haven't touched either of those since 1.0, so nothing should have changed there.

jplu111 commented 3 years ago

Hello,

Thank you for your reply.

I have used the dictionary and German model from these links and I still receive the error "There were phones in the dictionary that do not have acoustic models: &1, @0, W1, a1, i1, k, n, |1"

mmcauliffe commented 3 years ago

I've updated the German prosodylab dictionary (and renamed it to be more consistent with the naming scheme elsewhere): https://github.com/MontrealCorpusTools/mfa-models/blob/master/dictionary/german_prosodylab.dict. I removed some foreign words that had phones not in the acoustic model, but I didn't have an issue with the ones that you listed, and they show up properly in the german_prosodylab.zip meta file, so I'm a bit confused. You can try rerunning with the new dictionary file, but since I wasn't able to reproduce your error it still might not work.

Can you provide some more information, i.e., commands you're running, the config.yaml and align.log files from the temporary directory (by default under ~/Documents/MFA)?

jplu111 commented 3 years ago

Thank for this and sorry my slow reply.

I have tried the new german dictionary and model and am still not able to run the german aligner (I can run the model on English). With the new acoustic model I have encountered two errors depending on the name of the german model. If I rename the german_prosodylab.zip file german.zip, I receive the same error (I’ve named this Error #1) as before. However, if I keep the acoustic model as german_prosodylab.zip I receive a different error (I’ve named this Error #2). The output for each error is in the attached file MFA german_errors. I have also attached the (renamed) config.yaml and align.log files for each error.

Thank you again.

From: Michael McAuliffe @.> Reply to: MontrealCorpusTools/Montreal-Forced-Aligner @.> Date: Friday, 5 March 2021 at 5:51 pm To: MontrealCorpusTools/Montreal-Forced-Aligner @.> Cc: JAMES PLUMRIDGE @.>, Author @.***> Subject: Re: [MontrealCorpusTools/Montreal-Forced-Aligner] German model/dictionary question (#232)

I've updated the German prosodylab dictionary (and renamed it to be more consistent with the naming scheme elsewhere): https://github.com/MontrealCorpusTools/mfa-models/blob/master/dictionary/german_prosodylab.dict. I removed some foreign words that had phones not in the acoustic model, but I didn't have an issue with the ones that you listed, and they show up properly in the german_prosodylab.zip meta file, so I'm a bit confused. You can try rerunning with the new dictionary file, but since I wasn't able to reproduce your error it still might not work.

Can you provide some more information, i.e., commands you're running, the config.yaml and align.log files from the temporary directory (by default under ~/Documents/MFA)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/232#issuecomment-791198925, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS4TICXV74S3MCWPA7FGUDLTCB5QDANCNFSM4X3VTHBA.

Blauesocken commented 3 years ago

I had the same problems with the German dictionary. Here is how I solved it:

I made the experience that the error messages are not always straightforward when the MFA fails.... The error message you got may be misleading. In my case, I noticed that an "umlaut" like ä, ö, ü is written as such in the dictionary. A lot of people avoid that and write (e.g.) "Foehn" instead of "Föhn" (which makes sense in some situations but not here), especially if they are not German themselves. The German dictionary and the acoustic model are working with those graphemes! I changed the transcriptions on my textgrids accordingly, same for "ß" (do not use "ss" instead!).

This solves any issue in my case. Took me a while to discover this simple issue. Maybe it helps in your case, too!

jplu111 commented 3 years ago

Thank you very much for this! I will try this out 😊

From: Blauesocken @.> Reply to: MontrealCorpusTools/Montreal-Forced-Aligner @.> Date: Wednesday, 23 June 2021 at 7:32 am To: MontrealCorpusTools/Montreal-Forced-Aligner @.> Cc: JAMES PLUMRIDGE @.>, Author @.***> Subject: Re: [MontrealCorpusTools/Montreal-Forced-Aligner] German model/dictionary question (#232)

I had the same problems with the German dictionary. Here is how I solved it:

I made the experience that the error messages are not always straightforward when the MFA fails.... The error message you got may be misleading. In my case, I noticed that an "umlaut" like ä, ö, ü is written as such in the dictionary. A lot of people avoid that and write (e.g.) "Foehn" instead of "Föhn" (which makes sense in some situations but not here), especially if they are not German themselves. The German dictionary and the acoustic model are working with those graphemes! I changed the transcriptions on my textgrids accordingly, same for "ß" (do not use "ss" instead!).

This solves any issue in my case. Took me a while to discover this simple issue. Maybe it helps in your case, too!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/232#issuecomment-866348964, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS4TICRCALDEQWEQN3IA5YTTUD6ORANCNFSM4X3VTHBA.

robertfromont commented 2 years ago

I'm having a similar-sounding problem. I'm running:

mfa validate corpus german_prosodylab german_prosodylab

...and I'm getting:

INFO - Setting up corpus information...
INFO - Number of speakers in corpus: 1, average number of utterances per speaker: 72.0
INFO - Setting up training data...
ERROR - There was an error in the run, please see the log.
PronunciationAcousticMismatchError: There were phones in the dictionary that do not have acoustic models: $1, A0, A1, V1, ^1, c0, drei0, null0, sechs1, w, zwei0, zwei1, {0, and {1

The transcripts use umlauts, so that's not the problem in this case.

I looked in the dictionary file for some of the offending phonemes:

...
ALLROUNDMAN $1 l r sechs1 n d m {0 n
ALLROUNDMANS $1 l r sechs1 n d m {0 n s
ALLROUNDMEN $1 l r sechs1 n d m E0 n
...
ALLTAG &1 l t a0 k
ALLTAGE &1 l t a0 g @0
...
PLUMPUDDING p l V1 m p U1 d I0 N
PLUMPUDDINGS p l V1 m p U1 d I0 N s
...
PARFUM p &0 r f ^1
...
LINGERIE l c0 Z @0 r i1
...
TEAMWORK t i1 m w drei0 k
TEAMWORKS t i1 m w drei0 k s
...

Maybe loanwords have been added to the dictionary with 'foreign' phonemes for which there are no models?

I'm not sure I understand the acoustic models files, but there's a meta.yaml file there that seems to list the phonemes as being:

phones: ['&0', '&1', )0, )1, +, /0, /1, '=', '@0', B0, B1, E0, E1, I0, I1, J, N, O0,
  O1, S, U0, U1, W0, W1, X0, X1, Y0, Y1, Z, _, a0, a1, b, d, e0, e1, f, g, h, i0,
  i1, j, k, l, m, n, null1, o0, o1, p, q0, q1, r, s, t, u0, u1, v, x, y0, y1, z, '|0',
  '|1', ~1]

...which doesn't include the 'borrowed' phonemes above.

Is there a straightforward way around this?

robertfromont commented 2 years ago

FYI in case it's useful for others, a workaround I've found worked for me:

  1. Download the dictionary file:
    https://github.com/MontrealCorpusTools/mfa-models/blob/main/dictionary/german_prosodylab.dict
  2. Remove the entries with the offending pronunciations:
    grep -v "\$1\|A0\|A1\|V1\|\^1\|c0\|drei0\|null0\|sechs1\|w\|zwei0\|zwei1\|{0\|{1" german_prosodylab.dict > german_prosodylab-no-borrowings.dict
  3. Use the new dictionary file for alignment:
    mfa align corpus ./german_prosodylab-no-borrowings.dict german_prosodylab aligned
mmcauliffe commented 2 years ago

Closing this as there are newly trained acoustic models and dictionaries that should perform better than the 1.0 ones based on the Prosodylab German dictionary: https://mfa-models.readthedocs.io/en/latest/index.html