MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 247 forks source link

Validating the corpus with "mfa validate" command, but get "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte" #489

Closed marianasignal closed 2 years ago

marianasignal commented 2 years ago

Debugging checklist

[ ] Have you updated to latest MFA version? yes, version is 2.0.5 image [ ] Have you tried rerunning the command with the --clean flag? yes, the command is "mfa validate data/speech/wav/aishell100/BAC009 data/MFA2/pretrained_models/dictionary/mandarin_china_mfa.dict data/MFA2/pretrained_models/acoustic/mandarin_mfa.zip --clean"

Describe the issue A clear and concise description of what the bug is. (MFAligner) audio_test@ubuntu:/data/y00580163/PDAugment$ mfa validate data/speech/wav/aishell100/BAC009 data/MFA2/pretrained_models/dictionary/mandarin_china_mfa.dict data/MFA2/pretrained_models/acoustic/mandarin_mfa.zip --clean Exception ignored in atexit callback: <bound method ExitHooks.history_save_handler of <montreal_forced_aligner.command_line.mfa.ExitHooks object at 0x7f00ab222380>> Traceback (most recent call last): File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/command_line/mfa.py", line 103, in history_save_handler raise self.exception File "/home/audio_test/.conda/envs/MFAligner/bin/mfa", line 11, in sys.exit(main()) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/command_line/mfa.py", line 1077, in main run_validate_corpus(args, unknown) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/command_line/validate.py", line 154, in run_validate_corpus validate_corpus(args, unknown) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/command_line/validate.py", line 35, in validate_corpus validator = PretrainedValidator( File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/validation/corpus_validator.py", line 1323, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/alignment/pretrained.py", line 65, in init super().init(kw) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/validation/corpus_validator.py", line 423, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/alignment/base.py", line 68, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/corpus/acoustic_corpus.py", line 1020, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/corpus/acoustic_corpus.py", line 93, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/corpus/base.py", line 98, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/abc.py", line 465, in init super().init(kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/abc.py", line 287, in init super().init(**kwargs) File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/dictionary/multispeaker.py", line 148, in init self.dictionary_model = DictionaryModel( File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/site-packages/montreal_forced_aligner/models.py", line 951, in init for line in f: File "/home/audio_test/.conda/envs/MFAligner/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte

For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in?Chinese pinyin
    • How many files/speakers? aishell data set has 340 speakers
    • Are you using lab files or TextGrid files for input? i use lab file
  2. Dictionary
  3. Acoustic model

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA).i only has the command_history.yaml command_history.zip

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

marianasignal commented 2 years ago

i change the dictionary from "mandarin_china_mfa v2.0.0" to "mandarin_china_mfa v2.0.0a", which sovle the problem, thanks. image