MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.53k stars 243 forks source link

IndexError: list index out of range | get_words_speaker_mapping #110

Closed LSRAO closed 9 months ago

LSRAO commented 9 months ago

In was facing an issue of vad progress being stuck in diarize_parallel.py Discussion link Which did not happen in the notebook. So I turned that notebook into python project. modifying it to read files from a folder.

Issue: IndexError: list index out of range

Stack trace:

023-10-18 16:35:55.023445: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-18 16:35:55.023494: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-18 16:35:55.023550: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-18 16:35:55.029044: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-18 16:35:55.685416: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ksuser/noise/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/home/ksuser/noise/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /home/ksuser/Documents/conversationTranscribe/temp_outputs/htdemucs
Separating track data/input_audio/1696526743613_1000060738175_1022_2224792.mp3
  0%|                                                                                  | 0.0/64.35 [00:00<?, ?seconds/s]Segmentation fault (core dumped)
WARNING:root:Source splitting failed, using original audio file.
[NeMo I 2023-10-18 16:36:34 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-10-18 16:36:34 cloud:58] Found existing object /home/ksuser/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-10-18 16:36:34 cloud:64] Re-using file from: /home/ksuser/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2023-10-18 16:36:34 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-10-18 16:36:35 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-10-18 16:36:35 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-10-18 16:36:35 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-10-18 16:36:35 features:289] PADDING: 16
[NeMo I 2023-10-18 16:36:35 features:289] PADDING: 16
[NeMo I 2023-10-18 16:36:35 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/ksuser/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-10-18 16:36:35 features:289] PADDING: 16
[NeMo I 2023-10-18 16:36:35 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-10-18 16:36:35 cloud:58] Found existing object /home/ksuser/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-10-18 16:36:35 cloud:64] Re-using file from: /home/ksuser/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2023-10-18 16:36:35 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-10-18 16:36:35 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-10-18 16:36:35 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-10-18 16:36:35 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-10-18 16:36:35 features:289] PADDING: 16
[NeMo I 2023-10-18 16:36:35 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/ksuser/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-10-18 16:36:36 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-10-18 16:36:36 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false
    }
[NeMo W 2023-10-18 16:36:36 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-10-18 16:36:36 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-10-18 16:36:36 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.27it/s]
[NeMo I 2023-10-18 16:36:36 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2023-10-18 16:36:36 classification_models:272] Perform streaming frame-level VAD
[NeMo I 2023-10-18 16:36:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-10-18 16:36:36 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2023-10-18 16:36:36 collections:304] # 2 files loaded accounting to # 1 labels
vad: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.68it/s]
[NeMo I 2023-10-18 16:36:38 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2023-10-18 16:36:38 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.                                                                                             
creating speech segments: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.08it/s]
[NeMo I 2023-10-18 16:36:38 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2023-10-18 16:36:38 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-10-18 16:36:38 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-10-18 16:36:38 collections:302] Dataset loaded with 57 items, total duration of  0.02 hours.
[NeMo I 2023-10-18 16:36:38 collections:304] # 57 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.00it/s]
[NeMo I 2023-10-18 16:36:39 clustering_diarizer:389] Saved embedding files to /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-10-18 16:36:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2023-10-18 16:36:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-10-18 16:36:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-10-18 16:36:39 collections:302] Dataset loaded with 67 items, total duration of  0.02 hours.
[NeMo I 2023-10-18 16:36:39 collections:304] # 67 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.08it/s]
[NeMo I 2023-10-18 16:36:40 clustering_diarizer:389] Saved embedding files to /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-10-18 16:36:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2023-10-18 16:36:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-10-18 16:36:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-10-18 16:36:40 collections:302] Dataset loaded with 91 items, total duration of  0.02 hours.
[NeMo I 2023-10-18 16:36:40 collections:304] # 91 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.78it/s]
[NeMo I 2023-10-18 16:36:41 clustering_diarizer:389] Saved embedding files to /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-10-18 16:36:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2023-10-18 16:36:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-10-18 16:36:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-10-18 16:36:41 collections:302] Dataset loaded with 120 items, total duration of  0.02 hours.
[NeMo I 2023-10-18 16:36:41 collections:304] # 120 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.86it/s]
[NeMo I 2023-10-18 16:36:42 clustering_diarizer:389] Saved embedding files to /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-10-18 16:36:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2023-10-18 16:36:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-10-18 16:36:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-10-18 16:36:42 collections:302] Dataset loaded with 189 items, total duration of  0.03 hours.
[NeMo I 2023-10-18 16:36:42 collections:304] # 189 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.25it/s]
[NeMo I 2023-10-18 16:36:43 clustering_diarizer:389] Saved embedding files to /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings
clustering: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.71it/s]
[NeMo I 2023-10-18 16:36:44 clustering_diarizer:464] Outputs are saved in /home/ksuser/Documents/conversationTranscribe/temp_outputs directory
[NeMo W 2023-10-18 16:36:44 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-10-18 16:36:44 msdd_models:960] Loading embedding pickle file of scale:0 at /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2023-10-18 16:36:44 msdd_models:960] Loading embedding pickle file of scale:1 at /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2023-10-18 16:36:44 msdd_models:960] Loading embedding pickle file of scale:2 at /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2023-10-18 16:36:44 msdd_models:960] Loading embedding pickle file of scale:3 at /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2023-10-18 16:36:44 msdd_models:960] Loading embedding pickle file of scale:4 at /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2023-10-18 16:36:44 msdd_models:938] Loading cluster label file from /home/ksuser/Documents/conversationTranscribe/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2023-10-18 16:36:44 collections:617] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-10-18 16:36:44 collections:620] Total 3 session files loaded accounting to # 3 audio clips
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.99it/s]
[NeMo I 2023-10-18 16:36:44 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2023-10-18 16:36:44 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-10-18 16:36:44 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-10-18 16:36:44 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-10-18 16:36:44 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-10-18 16:36:44 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-10-18 16:36:44 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-10-18 16:36:44 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-10-18 16:36:44 msdd_models:1431]   

Traceback (most recent call last):
  File "/home/ksuser/Documents/conversationTranscribe/main.py", line 6, in <module>
    diarize.process_audio_folder(input_folder, output_folder)
  File "/home/ksuser/Documents/conversationTranscribe/src/diarize.py", line 127, in process_audio_folder
    wsm = get_words_speaker_mapping(word_timestamps, speaker_ts, "start")
  File "/home/ksuser/Documents/conversationTranscribe/src/__init__.py", line 116, in get_words_speaker_mapping
    s, e, sp = spk_ts[turn_idx]
IndexError: list index out of range
LSRAO commented 9 months ago

So it was a silly mistake. I wrote this:

s, e, sp = spk_ts[turn_idx]
turn_idx = min(turn_idx, len(spk_ts) - 1)

Correct order of code:

turn_idx = min(turn_idx, len(spk_ts) - 1)
s, e, sp = spk_ts[turn_idx]