MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.27k stars 242 forks source link

[BUG] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 10: invalid continuation byte #688

Closed ninackjeong closed 11 months ago

ninackjeong commented 11 months ago

Debugging checklist

[ O] Have you updated to latest MFA version? [ X ] Have you tried rerunning the command with the --clean flag?

Describe the issue

When running, "mfa g2p ['zip' file created by 'mfa train_g2p'] [my dataset] [output directory]." I got the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 10: invalid continuation byte

For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in? Korean
    • How many files/speakers? 2 people, each having produced a lot of utterances (this is just for testing; I have tremendous files to process)
    • Are you using lab files or TextGrid files for input? lab files
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one? No
    • If it's a custom dictionary, what is the phoneset?
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one? g2p
    • If it's a model you've trained, what data was it trained on? Korean sound files in pcm format

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA).

(aligner) Cheonkams-MacBook-Pro:scripts ninackjeong$ mfa g2p /Volumes/ssd/dissertation/scripts/tono-init-spon/sample-test/data/sound/SDRW2100000003_pcm/korean.zip /Volumes/ssd/dissertation/scripts/tono-init-spon/sample-test/data/sound/SDRW2100000003_pcm/ /Volumes/ssd/dissertation/scripts/tono-init-spon/sample-test/data/sound/SDRW2100000003_pcm/korean.txt
 ERROR    There was an error in the run, please see the log.                    
Exception ignored in atexit callback: <bound method ExitHooks.history_save_handler of <montreal_forced_aligner.command_line.mfa.ExitHooks object at 0x1a8901a90>>
Traceback (most recent call last):
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/montreal_forced_aligner/command_line/mfa.py", line 99, in history_save_handler
    raise self.exception
  File "/Users/ninackjeong/miniconda3/envs/aligner/bin/mfa", line 10, in <module>
    sys.exit(mfa_cli())
             ^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/rich_click/rich_group.py", line 21, in main
    rv = super().main(*args, standalone_mode=False, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/montreal_forced_aligner/command_line/g2p.py", line 124, in g2p_cli
    g2p.setup()
  File "/Users/ninackjeong/miniconda3/envs/aligner/lib/python3.11/site-packages/montreal_forced_aligner/g2p/generator.py", line 837, in setup
    for line in f:
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 10: invalid continuation byte

Desktop (please complete the following information):

Additional context Add any other context about the problem here. Q1. (This may not be relevant to this issue) Do I need to transform PCM data to wav format?

mmcauliffe commented 11 months ago

Can you try switching the order of the corpus directory and model? So something like this should work:

mfa g2p  /Volumes/ssd/dissertation/scripts/tono-init-spon/sample-test/data/sound/SDRW2100000003_pcm/ /Volumes/ssd/dissertation/scripts/tono-init-spon/sample-test/data/sound/SDRW2100000003_pcm/korean.zip /Volumes/ssd/dissertation/scripts/tono-init-spon/sample-test/data/sound/SDRW2100000003_pcm/korean.txt

The order in MFA is [input_files] [models] [output_files], see mfa g2p [OPTIONS] INPUT_PATH G2P_MODEL_PATH OUTPUT_PATH from https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/dictionary_generating.html

ninackjeong commented 11 months ago

Ah, my bad! It worked!