CTC forced alignment error

tophee commented 4 months ago

I am trying to to process a file in Swedish.

I'm using this command:

python diarize.py -a 12-300s-tiny.wav --whisper-model large-v3 --language sv --suppress_numerals --device cpu --no-stem

It runs ok for quite a while, but when it comes to the alignment part, it suddenly stops with a cryptic error (pasted below with some contex).

This is on a MacBook Pro M1, in case it matters.

Any hints that might help me understand (and possibly fix) the error is appreciated.

Suppressing numeral and symbol tokens
Some weights of the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/Users/xhxxch/whisper-dia/diarize.py", line 155, in <module>
    spans = get_spans(tokens_starred, segments, alignment_tokenizer.decode(blank_id))
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/ctc_forced_aligner/alignment_utils.py", line 63, in get_spans
    assert seg.label == ltr, f"{seg.label} != {ltr}"
AssertionError: a != <star>

MahmoudAshraf97 commented 4 months ago

Can you upload the audio file to reproduce?

tophee commented 4 months ago

Unfortunately not this one. I can try to find one that I can share.

Are you suggesting the error is related to this specific audio file?

MahmoudAshraf97 commented 4 months ago

Yes, the error is an error in alignment script which completely depends on the generated transcription

On Wed, May 22, 2024, 8:40 PM Chris @.***> wrote:

Unfortunately not this one. I can try to find one that I can share.

Are you suggesting the error is related to this specific audio file?

— Reply to this email directly, view it on GitHub https://github.com/MahmoudAshraf97/whisper-diarization/issues/190#issuecomment-2125402797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXHGLFQXAMPRLFR3SQAJV3ZDTKANAVCNFSM6AAAAABIECYQ52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVGQYDENZZG4 . You are receiving this because you commented.Message ID: @.***>

tophee commented 4 months ago

OK, I'm checking with another file, to start with. And i noticed that it says:

[NeMo W 2024-05-22 20:23:04 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model.

I'm not doing translation, so I assume this is not a problem, right?

MahmoudAshraf97 commented 4 months ago

Not a problem

tophee commented 4 months ago

I'm confused. I tried the above command on a different file twice and got two different errors, each different from the one reported above.

First time ended with

Suppressing numeral and symbol tokens
Traceback (most recent call last):
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1472, in _get_module
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers.models.wav2vec2_bert.configuration_wav2vec2_bert'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/xhxxch/whisper-dia/diarize.py", line 124, in <module>
    alignment_model, alignment_tokenizer, alignment_dictionary = load_alignment_model(
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/ctc_forced_aligner/alignment_utils.py", line 276, in load_alignment_model
    AutoModelForCTC.from_pretrained(
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 540, in from_pretrained
    if kwargs_orig.get("quantization_config", None) is not None:
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 751, in keys
    return getattribute_from_module(self._modules[module_name], attr)
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 752, in <listcomp>
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 748, in _load_attr_from_module
    module_name = model_type_to_module_name(model_type)
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 692, in getattribute_from_module
    return None
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1462, in __getattr__
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1474, in _get_module
RuntimeError: Failed to import transformers.models.wav2vec2_bert.configuration_wav2vec2_bert because of the following error (look up to see its traceback):
No module named 'transformers.models.wav2vec2_bert.configuration_wav2vec2_bert'

While the above process was executing I also did pip install 'nemo_toolkit[nlp]'. Assuming that this may be the reason why I'm getting a different error, I did pip uninstall 'nemo_toolkit[nlp]' and just to make sure that I still have what I need I did pip install 'nemo_toolkit[asr]' again.

After that the very same command failed immediately with

objc[6072]: Class AVFFrameReceiver is implemented in both /opt/anaconda3/envs/pretzel/lib/libavdevice.58.8.100.dylib (0x1759f0798) and /opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x17860c760). One of the two will be used. Which one is undefined.
objc[6072]: Class AVFAudioReceiver is implemented in both /opt/anaconda3/envs/pretzel/lib/libavdevice.58.8.100.dylib (0x1759f07e8) and /opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/av/.dylibs/libavdevice.60.1.100.dylib (0x17860c7b0). One of the two will be used. Which one is undefined.
Traceback (most recent call last):
  File "/Users/xhxxch/whisper-dia/diarize.py", line 3, in <module>
    from helpers import (
  File "/Users/xhxxch/whisper-dia/helpers.py", line 7, in <module>
    from whisperx.alignment import DEFAULT_ALIGN_MODELS_HF, DEFAULT_ALIGN_MODELS_TORCH
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/whisperx/__init__.py", line 1, in <module>
    from .transcribe import load_model
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/whisperx/transcribe.py", line 10, in <module>
    from .asr import load_model
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/whisperx/asr.py", line 13, in <module>
    from .vad import load_vad_model, merge_chunks
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/whisperx/vad.py", line 11, in <module>
    from pyannote.audio.pipelines import VoiceActivityDetection
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/pyannote/audio/pipelines/__init__.py", line 26, in <module>
    from .speaker_diarization import SpeakerDiarization
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_diarization.py", line 42, in <module>
    from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/pyannote/audio/pipelines/speaker_verification.py", line 56, in <module>
    from nemo.collections.asr.models import (
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/asr/__init__.py", line 15, in <module>
    from nemo.collections.asr import data, losses, models, modules
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/asr/models/__init__.py", line 36, in <module>
    from nemo.collections.asr.models.transformer_bpe_models import EncDecTransfModelBPE
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/asr/models/transformer_bpe_models.py", line 52, in <module>
    from nemo.collections.nlp.modules.common import TokenClassifier
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/nlp/__init__.py", line 15, in <module>
    from nemo.collections.nlp import data, losses, models, modules
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/nlp/models/__init__.py", line 31, in <module>
    from nemo.collections.nlp.models.machine_translation import MTEncDecModel
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/nlp/models/machine_translation/__init__.py", line 15, in <module>
    from nemo.collections.nlp.models.machine_translation.mt_enc_dec_bottleneck_model import MTBottleneckModel
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/nlp/models/machine_translation/mt_enc_dec_bottleneck_model.py", line 23, in <module>
    from nemo.collections.nlp.models.machine_translation.mt_enc_dec_model import MTEncDecModel
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 38, in <module>
    from nemo.collections.common.tokenizers.chinese_tokenizers import ChineseProcessor
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/nemo/collections/common/tokenizers/chinese_tokenizers.py", line 38, in <module>
    import opencc
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/opencc.py", line 24, in <module>
    libopencc = CDLL('libopencc.so.1', use_errno=True)
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(libopencc.so.1, 0x0006): tried: 'libopencc.so.1' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibopencc.so.1' (no such file), '/opt/anaconda3/envs/pretzel/lib/python3.10/lib-dynload/../../libopencc.so.1' (no such file), '/opt/anaconda3/envs/pretzel/bin/../lib/libopencc.so.1' (no such file), '/usr/lib/libopencc.so.1' (no such file, not in dyld cache), 'libopencc.so.1' (no such file), '/usr/local/lib/libopencc.so.1' (no such file), '/usr/lib/libopencc.so.1' (no such file, not in dyld cache)

Edit: I reinstalled the requirements (exceopt for nemo, which fails via the requirements.txt), but the error remains the same, no matter what audio file I use.

MahmoudAshraf97 commented 4 months ago

please reinstall ctc-forced-aligner again, it needs to be recompiled with the torch version you are using, and upgrade transformers to the latest version or atleast 4.34

MahmoudAshraf97 commented 4 months ago

or it's better to reinstall all the requirements

tophee commented 4 months ago

or it's better to reinstall all the requirements

I did, but that didn't change anything.

What seems to work (still executing, so far) is the solution mentioned in https://github.com/MahmoudAshraf97/whisper-diarization/issues/177#issuecomment-2097047524. I did

brew install opencc
ln -s /opt/homebrew/lib/libopencc.dylib libopencc.so.1

Now I'm waiting for the command to process to finish after Suppressing numeral and symbol tokens

What puzzles me, though is, why I oreviously (with the first testfile above) didn't get an error about libopencc.so.1 and now suddenly I did.

Edit: OK, we're back to where we were in the OP:

Suppressing numeral and symbol tokens
Some weights of the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at MahmoudAshraf/mms-300m-1130-forced-aligner and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/Users/xhxxch/whisper-dia/diarize.py", line 155, in <module>
    spans = get_spans(tokens_starred, segments, alignment_tokenizer.decode(blank_id))
  File "/opt/anaconda3/envs/pretzel/lib/python3.10/site-packages/ctc_forced_aligner/alignment_utils.py", line 63, in get_spans
    assert seg.label == ltr, f"{seg.label} != {ltr}"
AssertionError: g != <star>

But this is with a different audio file. So the error is not specific to one specific file. I'm suspecting it's not so much aboyút the audio file but about the language. You can probably take any Audio file in Swedish and reproduce the error.

Maybe this is related: As I am trying to understand how your script works, it looks like it is using a wav2vec2 model, just like whisperX which made me wonder how it works with Swedish audio, given that Swedish is not one of the languages for which whisperX already has a wav2vec2 model (when I tried whisperX I used KBLab/wav2vec2-large-voxrex-swedish).

MahmoudAshraf97 commented 4 months ago

@tophee my script uses a multilingual alignment model, so if you changed the default model to a model which has the native vocabulary of the language you need to turn the romanization off too, can you upload the audio file to test as I have tried a Swedish audio and it worked fine with the default model

MahmoudAshraf97 / whisper-diarization

CTC forced alignment error #190