MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.31k stars 243 forks source link

Regex Error on Running MFA Align #784

Closed shreeshailgan closed 6 months ago

shreeshailgan commented 6 months ago

I'm running mfa align on a dataset.

I get the following error message

INFO     Initializing multiprocessing jobs...                                        
 ERROR    There was an error in the run, please see the log.                          
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/root/miniconda3/envs/mfa301/bin/mfa", line 10, in <module>
    sys.exit(mfa_cli())
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/command_line/align.py", line 122, in align_corpus_cli
    aligner.align()
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/alignment/pretrained.py", line 333, in align
    self.setup()
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/alignment/pretrained.py", line 207, in setup
    self.load_corpus()
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/corpus/acoustic_corpus.py", line 1104, in load_corpus
    self.normalize_text()
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/corpus/base.py", line 655, in normalize_text
    args = self.normalize_text_arguments()
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/corpus/base.py", line 621, in normalize_text_arguments
    tokenizers = getattr(self, "tokenizers", None)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/dictionary/multispeaker.py", line 135, in tokenizers
    self._tokenizers[d.id] = SimpleTokenizer(
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/tokenization/simple.py", line 356, in __init__
    self._compile_regexes()
  File "/root/miniconda3/envs/mfa301/lib/python3.9/site-packages/montreal_forced_aligner/tokenization/simple.py", line 458, in _compile_regexes
    self.final_clitic_regex = re.compile(rf"(?<=\w)({'|'.join(final_clitics)})$")
  File "/root/miniconda3/envs/mfa301/lib/python3.9/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "/root/miniconda3/envs/mfa301/lib/python3.9/sre_parse.py", line 969, in parse
    raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 8071

Is there an issue with my lexicon file that I need to fix?

mfa version = montreal-forced-aligner 3.0.1 pyhd8ed1ab_0 conda-forge. command with args: mfa align <wav-dir> <lexicon-path> <acoustic-model-path> <out-dir> --clean --num_jobs 32 --single_speaker dataset: libri.

shreeshailgan commented 6 months ago

It was an issue with the lexicon file: Apparently, tokens like '{in_troubled_tones_) should not be in it. It worked fine after removing this one token from it.