MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.3k stars 273 forks source link

Some errors occurred during Speaker Diarization using the NeMo MSDD Model #209

Open ievenight opened 3 weeks ago

ievenight commented 3 weeks ago

Issue Description

When I try to run diarize.py or the Jupyter Notebook version, I encounter a dimension mismatch issue during the MSDD step

log

100% [................................................................................] 7646 / 7646[NeMo I 2024-08-17 17:31:13 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-08-17 17:31:13 cloud:58] Found existing object C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2024-08-17 17:31:13 cloud:64] Re-using file from: C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo
[NeMo I 2024-08-17 17:31:13 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-08-17 17:31:14 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2024-08-17 17:31:14 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2024-08-17 17:31:14 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2024-08-17 17:31:14 features:289] PADDING: 16
[NeMo I 2024-08-17 17:31:14 features:289] PADDING: 16
[NeMo I 2024-08-17 17:31:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2024-08-17 17:31:15 features:289] PADDING: 16
[NeMo I 2024-08-17 17:31:16 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-17 17:31:16 cloud:58] Found existing object C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-17 17:31:16 cloud:64] Re-using file from: C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo
[NeMo I 2024-08-17 17:31:16 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-08-17 17:31:16 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2024-08-17 17:31:16 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2024-08-17 17:31:16 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2024-08-17 17:31:16 features:289] PADDING: 16
[NeMo I 2024-08-17 17:31:16 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from C:\Users\ckic\.cache\torch\NeMo\NeMo_1.20.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-17 17:31:16 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-17 17:31:16 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo W 2024-08-17 17:31:16 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2024-08-17 17:31:16 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-17 17:31:16 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.50s/it]
[NeMo I 2024-08-17 17:31:18 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-17 17:31:18 classification_models:272] Perform streaming frame-level VAD
[NeMo I 2024-08-17 17:31:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 17:31:18 collections:302] Dataset loaded with 212 items, total duration of  2.95 hours.
[NeMo I 2024-08-17 17:31:18 collections:304] # 212 files loaded accounting to # 1 labels

vad: 100%|███████████████████████████████████████████████████████████████████████████| 212/212 [01:02<00:00,  3.41it/s]
[NeMo I 2024-08-17 17:32:21 clustering_diarizer:250] Generating predictions with overlapping input segments

[NeMo I 2024-08-17 17:34:04 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|██████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.84s/it]
[NeMo I 2024-08-17 17:34:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale0.json
[NeMo I 2024-08-17 17:34:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 17:34:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 17:34:14 collections:302] Dataset loaded with 11136 items, total duration of  4.28 hours.
[NeMo I 2024-08-17 17:34:14 collections:304] # 11136 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 174/174 [00:19<00:00,  8.79it/s]
[NeMo I 2024-08-17 17:34:38 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings
[NeMo I 2024-08-17 17:34:38 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale1.json
[NeMo I 2024-08-17 17:34:38 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 17:34:38 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 17:34:38 collections:302] Dataset loaded with 13484 items, total duration of  4.39 hours.
[NeMo I 2024-08-17 17:34:38 collections:304] # 13484 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 211/211 [00:20<00:00, 10.54it/s]
[NeMo I 2024-08-17 17:35:04 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings
[NeMo I 2024-08-17 17:35:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale2.json
[NeMo I 2024-08-17 17:35:05 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 17:35:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 17:35:05 collections:302] Dataset loaded with 17027 items, total duration of  4.51 hours.
[NeMo I 2024-08-17 17:35:05 collections:304] # 17027 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 267/267 [00:24<00:00, 10.77it/s]
[NeMo I 2024-08-17 17:35:39 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings
[NeMo I 2024-08-17 17:35:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale3.json
[NeMo I 2024-08-17 17:35:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 17:35:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 17:35:40 collections:302] Dataset loaded with 23024 items, total duration of  4.64 hours.
[NeMo I 2024-08-17 17:35:40 collections:304] # 23024 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 360/360 [00:31<00:00, 11.53it/s]
[NeMo I 2024-08-17 17:36:29 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings
[NeMo I 2024-08-17 17:36:29 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\subsegments_scale4.json
[NeMo I 2024-08-17 17:36:29 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 17:36:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 17:36:30 collections:302] Dataset loaded with 35138 items, total duration of  4.78 hours.
[NeMo I 2024-08-17 17:36:30 collections:304] # 35138 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|██████████████████████████████████████████████████████| 550/550 [00:40<00:00, 13.63it/s]
[NeMo I 2024-08-17 17:37:49 clustering_diarizer:389] Saved embedding files to C:\Users\ckic\Desktop\whisper-diarization-main\temp_outputs\speaker_outputs\embeddings
clustering:   0%|                                                                                | 0/1 [01:26<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[11], line 3
      1 # Initialize NeMo MSDD diarization model
      2 msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda")
----> 3 msdd_model.diarize()
      5 del msdd_model
      6 torch.cuda.empty_cache()

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\torch\utils\_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\msdd_models.py:1180, in NeuralDiarizer.diarize(self)
   1173 @torch.no_grad()
   1174 def diarize(self) -> Optional[List[Optional[List[Tuple[DiarizationErrorRate, Dict]]]]]:
   1175     """
   1176     Launch diarization pipeline which starts from VAD (or a oracle VAD stamp generation), initialization clustering and multiscale diarization decoder (MSDD).
   1177     Note that the result of MSDD can include multiple speakers at the same time. Therefore, RTTM output of MSDD needs to be based on `make_rttm_with_overlap()`
   1178     function that can generate overlapping timestamps. `self.run_overlap_aware_eval()` function performs DER evaluation.
   1179     """
-> 1180     self.clustering_embedding.prepare_cluster_embs_infer()
   1181     self.msdd_model.pairwise_infer = True
   1182     self.get_emb_clus_infer(self.clustering_embedding)

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\msdd_models.py:699, in ClusterEmbedding.prepare_cluster_embs_infer(self)
    695 """
    696 Launch clustering diarizer to prepare embedding vectors and clustering results.
    697 """
    698 self.max_num_speakers = self.cfg_diar_infer.diarizer.clustering.parameters.max_num_speakers
--> 699 self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
    700     self._cfg_msdd.test_ds.manifest_filepath, self._cfg_msdd.test_ds.emb_dir
    701 )

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\msdd_models.py:866, in ClusterEmbedding.run_clustering_diarizer(self, manifest_filepath, emb_dir)
    864 logging.info(f"Multiscale Weights: {self.clus_diar_model.multiscale_args_dict['multiscale_weights']}")
    865 logging.info(f"Clustering Parameters: {clustering_params_str}")
--> 866 scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size)
    868 # If RTTM (ground-truth diarization annotation) files do not exist, scores is None.
    869 if scores is not None:

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py:456, in ClusteringDiarizer.diarize(self, paths2audio_files, batch_size)
    451 embs_and_timestamps = get_embs_and_timestamps(
    452     self.multiscale_embeddings_and_timestamps, self.multiscale_args_dict
    453 )
    455 # Clustering
--> 456 all_reference, all_hypothesis = perform_clustering(
    457     embs_and_timestamps=embs_and_timestamps,
    458     AUDIO_RTTM_MAP=self.AUDIO_RTTM_MAP,
    459     out_rttm_dir=out_rttm_dir,
    460     clustering_params=self._cluster_params,
    461     device=self._speaker_model.device,
    462     verbose=self.verbose,
    463 )
    464 logging.info("Outputs are saved in {} directory".format(os.path.abspath(self._diarizer_params.out_dir)))
    466 # Scoring

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\parts\utils\speaker_utils.py:486, in perform_clustering(embs_and_timestamps, AUDIO_RTTM_MAP, out_rttm_dir, clustering_params, device, verbose)
    482     num_speakers = -1
    484 base_scale_idx = uniq_embs_and_timestamps['multiscale_segment_counts'].shape[0] - 1
--> 486 cluster_labels = speaker_clustering.forward_infer(
    487     embeddings_in_scales=uniq_embs_and_timestamps['embeddings'],
    488     timestamps_in_scales=uniq_embs_and_timestamps['timestamps'],
    489     multiscale_segment_counts=uniq_embs_and_timestamps['multiscale_segment_counts'],
    490     multiscale_weights=uniq_embs_and_timestamps['multiscale_weights'],
    491     oracle_num_speakers=int(num_speakers),
    492     max_num_speakers=int(clustering_params.max_num_speakers),
    493     max_rp_threshold=float(clustering_params.max_rp_threshold),
    494     sparse_search_volume=int(clustering_params.sparse_search_volume),
    495 )
    497 del uniq_embs_and_timestamps
    498 if cuda:

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\parts\utils\offline_clustering.py:1288, in SpeakerClustering.forward_infer(self, embeddings_in_scales, timestamps_in_scales, multiscale_segment_counts, multiscale_weights, oracle_num_speakers, max_rp_threshold, max_num_speakers, enhanced_count_thres, sparse_search_volume, fixed_thres, kmeans_random_trials)
   1285 if oracle_num_speakers > 0:
   1286     max_num_speakers = oracle_num_speakers
-> 1288 mat = getMultiScaleCosAffinityMatrix(
   1289     multiscale_weights, self.embeddings_in_scales, self.timestamps_in_scales, self.device
   1290 )
   1292 nmesc = NMESC(
   1293     mat,
   1294     max_num_speakers=max_num_speakers,
   (...)
   1303     device=self.device,
   1304 )
   1306 # If there are less than `min_samples_for_nmesc` segments, est_num_of_spk is 1.

File ~\anaconda3\envs\whisper-diarization\lib\site-packages\nemo\collections\asr\parts\utils\offline_clustering.py:529, in getMultiScaleCosAffinityMatrix(multiscale_weights, embeddings_in_scales, timestamps_in_scales, device)
    527     repeated_tensor_0 = torch.repeat_interleave(score_mat_torch, repeats=repeat_list, dim=0).to(device)
    528     repeated_tensor_1 = torch.repeat_interleave(repeated_tensor_0, repeats=repeat_list, dim=1).to(device)
--> 529     fused_sim_d += multiscale_weights[scale_idx] * repeated_tensor_1
    530 return fused_sim_d

RuntimeError: The size of tensor a (35138) must match the size of tensor b (31801) at non-singleton dimension 1
MahmoudAshraf97 commented 3 weeks ago

Please upload an audio file to reproduce the problem

ievenight commented 3 weeks ago

I tried this audio file https://content.blubrry.com/takeituneasy/lex_ai_elon_musk_and_neuralink_team.mp3