MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.33k stars 246 forks source link

[BUG] G2P fails on fresh install of conda and mfa on Windows #508

Closed hypnaceae closed 2 years ago

hypnaceae commented 2 years ago

I'm working with a fresh conda and MFA installation. I've generated a list of OOVs in my corpus with mfa validate, now trying to run G2P on the output .txt file to supplement the dictionary. Language is Russian, though I had the same error trying to run G2P on Czech. I'm using the 2.0.0a Russian G2P model for this case.

Here's the error:

(base) PS C:\Users\admin> mfa g2p --g2p_model_path .\Desktop\russian_mfa.zip --input_path .\Desktop\oovs.txt --output_path .\Desktop\oovs_lex.txt --debug --clean
Generating pronunciations from G2P model
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 10: invalid continuation byte

I'm not sure what it's failing to read, as there's no traceback. The input oovs text file is taken directly from the output of mfa validate, and is encoded in utf-8. See below: oovs_found_russian_mfa.txt

Are there any workarounds I could try?

mmcauliffe commented 2 years ago

Can you rerun with --debug and see exactly where it's hitting that error?

hypnaceae commented 2 years ago

I'm running with --debug already. Very strange that it has no effect, is it positional?

mmcauliffe commented 2 years ago

Oh sorry, my bad, I meant --verbose that'll print the full stack trace

hypnaceae commented 2 years ago

Thanks. Seems like --verbose also has no effect. Console output is exactly the same.

mmcauliffe commented 2 years ago

Can you try running mfa g2p .\Desktop\oovs.txt .\Desktop\russian_mfa.zip .\Desktop\oovs_lex.txt --debug --clean?

The positional arguments can't be specified with the --option style flags, so I think that's what's causing this?

hypnaceae commented 2 years ago

Sure, here's the result:

(base) PS C:\Users\admin\Desktop> mfa g2p .\russian_mfa.zip .\oovs.txt .\oovs_lex.txt --debug --clean
Generating pronunciations from G2P model
WARNING! The following graphemes were not found in the specified G2P model: - a b c d e g h i k l m n o p r s t u v x z а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ё
montreal_forced_aligner.exceptions.G2PError: Previously trained Phonetisaurus models from 1.1 and earlier are not currently supported. Please retrain your model using 2.0+

That's with this model: https://github.com/MontrealCorpusTools/mfa-models/releases/tag/g2p-russian_mfa-v2.0.0a

After that, I ran mfa download g2p russian_g2p and mfa g2p russian_g2p .\oovs.txt .\oovs_lex.txt and it actually started generating pronunciations, though they look a bit weird, for example: яшеньку jA S jE nj k u . Not sure what phoneset that is, but I need the output phoneset to be the same as that in the latest dict.

Another thing I noticed was that despite installing version 2.0.6, mfa version returns 2.0.0a21. The package filename in miniconda3/pkgs/ has 2.0.6 and it's the only version of MFA installed on my machine so I'm not sure what's going on there.

Thanks for the support so far :)

mmcauliffe commented 2 years ago

Can you try rerunning with - - clean? looks like it's still using the 1.0 phone set.

On Wed., Oct. 5, 2022, 4:57 a.m. hypnaceae, @.***> wrote:

Sure, here's the result:

Generating pronunciations from G2P model

WARNING! The following graphemes were not found in the specified G2P model: - a b c d e g h i k l m n o p r s t u v x z а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ё

montreal_forced_aligner.exceptions.G2PError: Previously trained Phonetisaurus models from 1.1 and earlier are not currently supported. Please retrain your model using 2.0+```

That's with this model: https://github.com/MontrealCorpusTools/mfa-models/releases/tag/g2p-russian_mfa-v2.0.0a

After that, I ran mfa download g2p russian_g2p and mfa g2p russian_g2p .\oovs.txt .\oovs_lex.txt and it actually started generating pronunciations, though they look a bit weird, for example: яшеньку jA S jE nj k u . Not sure what phoneset that is, but I need the output phoneset to be the same as that in the latest dict.

— Reply to this email directly, view it on GitHub https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/508#issuecomment-1268339270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVJOT5AG3AG5ES7BEBLMR3WBVUJ7ANCNFSM6AAAAAAQ4B62FM . You are receiving this because you were assigned.Message ID: @.*** com>

hypnaceae commented 2 years ago

No difference in output :/

mmcauliffe commented 2 years ago

Hmm, ok so it's working fine on my local machine, can you maybe delete the Documents/MFA folder, redownload the russian_mfa g2p model and re-run? I feel like there's some sticky files somewhere with the original 1.0 model.

Also weird about the version, did you maybe install it from pip at some point in addition to conda? What does which mfa (Unix) or where mfa (Windows) return?

hypnaceae commented 2 years ago

Nope, same exact output. :( Mind you this machine has never seen pre-2.0.0 MFA. I had some version (possibly 2.0.1) installed earlier but did a reinstall of miniconda recently which included uninstalling old packages. In any case, I got what I needed by running on a remote linux machine... And if it works natively in Windows for you then it's probably user error on my part or some old files hidden somewhere on my machine. I'll close the issue then, sincere thanks again.