MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.35k stars 249 forks source link

[BUG] Tokenizer/Text Normalization for Russian Fails (--language russian) #796

Closed NataliaShmueli closed 7 months ago

NataliaShmueli commented 7 months ago

Debugging checklist

[x ] Have you read the troubleshooting page (https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/troubleshooting.html) and searched the documentation to ensure that your issue is not addressed there? [x ] Have you updated to latest MFA version (check https://montreal-forced-aligner.readthedocs.io/en/latest/changelog/changelog_3.0.html)? What is the output of mfa version? [x ] Have you tried rerunning the command with the --clean flag?

Describe the issue A clear and concise description of what the bug is.

Text tokenization and normalization for --language russian crashes with error "AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'split'"

For Reproducing your issue Please fill out the following:

  1. Corpus structure
    • What language is the corpus in?
    • Russian
    • How many files/speakers?
    • 3000 speakers
    • Are you using lab files or TextGrid files for input?
    • .lab
  2. Dictionary
    • Are you using a dictionary from MFA? If so, which one?
    • Custom
    • If it's a custom dictionary, what is the phoneset?
    • IPA
  3. Acoustic model
    • If you're using an acoustic model, is it one download through MFA? If so, which one?
    • Self-created
    • If it's a model you've trained, what data was it trained on?
    • LibriSpeech, CommonVoice, others

Log file Please attach the log file for the run that encountered an error (by default these will be stored in ~/Documents/MFA).

LivingAudioRussian.log

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

It wouldn't post it but


  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\montreal_forced_aligner\corpus\base.py", line 743,
in normalize_text
    for w in result["normalized_text"].split():

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'split'

   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/652  [ 0:00:03 < -:--:-- , ? it/s ]
 ERROR    There was an error in the run, please see the log.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\Natalia\miniconda3\envs\alignernew\Scripts\mfa-script.py", line 9, in <module>
    sys.exit(mfa_cli())
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\rich_click\rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\click\decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\montreal_forced_aligner\command_line\align.py", line 122, in align_corpus_cli
    aligner.align()
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\montreal_forced_aligner\alignment\pretrained.py", line 333, in align
    self.setup()
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\montreal_forced_aligner\alignment\pretrained.py", line 207, in setup
    self.load_corpus()
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\montreal_forced_aligner\corpus\acoustic_corpus.py", line 1103, in load_corpus
    self.normalize_text()
  File "C:\Users\Natalia\miniconda3\envs\alignernew\lib\site-packages\montreal_forced_aligner\corpus\base.py", line 743, in normalize_text
    for w in result["normalized_text"].split():
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'split'```