NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.05k stars 2.51k forks source link

Diarization | IndexError: shape mismatch: indexing tensors could not be broadcast together #8278

Closed Oscaarjs closed 7 months ago

Oscaarjs commented 9 months ago

Describe the bug

When running diarization on a specific file I'm getting: IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [7557001], [7559750]

The same pipeline/config/setup etc has worked for a couple of thousand other files but now on this (and a few more files) I'm getting this issue suddenly.

Steps/Code to reproduce bug

Shared a Google Colab of how one can reproduce the error, the steps are also listed below.

Shared Google Colab notebook ipynb

Steps:

Pre-steps: Upload issue_file.wav, config.yaml and speech_timestamps.rttm to runtime (e.g. Colab) Above mentioned files can be obtained:

issue_file.wav : Audio file config.yaml: diarization config speech_timestamps.rttm: speech timestamps rttm

Let me know if anything's missing or unclear.

!apt-get update && apt-get install -y libsndfile1 ffmpeg
!pip install nemo_toolkit['asr']
import os
import torch
import yaml
import json
from omegaconf import OmegaConf
from nemo.collections.asr.models import ClusteringDiarizer

3.

def diarize(workdir: str, rttm_filepath):

    manifest_path  = os.path.join(workdir, "manifest.json")
    output_dir     = os.path.join(workdir, "output")

    manifest =  {
        'audio_filepath': '/content/issue_file.wav',
        'offset': 0,
        'duration': None,
        'label': 'infer',
        'text': '-',
        'num_speakers': None,
        'rttm_filepath': rttm_filepath,
        'uem_filepath': None,
    }

    with open('/content/config.yaml', "r") as config_file:
        config_dict = yaml.load(config_file, Loader=yaml.FullLoader)

    config = OmegaConf.create(config_dict['diarizer'])
    config.device = "cuda:0" if torch.cuda.is_available() else "cpu"
    config.diarizer.manifest_filepath = manifest_path
    config.diarizer.oracle_vad = True
    config.diarizer.speaker_embeddings.model_path = 'titanet_large'
    config.diarizer.out_dir = output_dir

    with open(manifest_path, "w") as manifest_file:
        json.dump(manifest, manifest_file)

    model = ClusteringDiarizer(cfg=config)
    model.diarize()

4.

diarize('/content/', '/content/speech_timestamps.rttm')

This fails:

[<ipython-input-7-4afd3b7bf15c>](https://localhost:8080/#) in diarize(workdir, rttm_filepath)
     30 
     31     model = ClusteringDiarizer(cfg=config)
---> 32     model.diarize()

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/clustering_diarizer.py](https://localhost:8080/#) in diarize(self, paths2audio_files, batch_size)
    454 
    455         # Clustering
--> 456         all_reference, all_hypothesis = perform_clustering(
    457             embs_and_timestamps=embs_and_timestamps,
    458             AUDIO_RTTM_MAP=self.AUDIO_RTTM_MAP,

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/speaker_utils.py](https://localhost:8080/#) in perform_clustering(embs_and_timestamps, AUDIO_RTTM_MAP, out_rttm_dir, clustering_params, device, verbose)
    483         base_scale_idx = uniq_embs_and_timestamps['multiscale_segment_counts'].shape[0] - 1
    484 
--> 485         cluster_labels = speaker_clustering.forward_infer(
    486             embeddings_in_scales=uniq_embs_and_timestamps['embeddings'],
    487             timestamps_in_scales=uniq_embs_and_timestamps['timestamps'],

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/longform_clustering.py](https://localhost:8080/#) in forward_infer(self, embeddings_in_scales, timestamps_in_scales, multiscale_segment_counts, multiscale_weights, oracle_num_speakers, max_rp_threshold, max_num_speakers, enhanced_count_thres, sparse_search_volume, fixed_thres, chunk_cluster_count, embeddings_per_chunk)
    407             )
    408         else:
--> 409             cluster_labels = self.speaker_clustering.forward_infer(
    410                 embeddings_in_scales=embeddings_in_scales,
    411                 timestamps_in_scales=timestamps_in_scales,

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in forward_infer(self, embeddings_in_scales, timestamps_in_scales, multiscale_segment_counts, multiscale_weights, oracle_num_speakers, max_num_speakers, max_rp_threshold, enhanced_count_thres, sparse_search_volume, fixed_thres, kmeans_random_trials)
   1376         )
   1377 
-> 1378         return self.forward_unit_infer(
   1379             mat=mat,
   1380             oracle_num_speakers=oracle_num_speakers,

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in forward_unit_infer(self, mat, oracle_num_speakers, max_num_speakers, max_rp_threshold, sparse_search_volume, est_num_of_spk_enhanced, fixed_thres, kmeans_random_trials)
   1225         if mat.shape[0] > self.min_samples_for_nmesc:
   1226             est_num_of_spk, p_hat_value = nmesc.forward()
-> 1227             affinity_mat = getAffinityGraphMat(mat, p_hat_value)
   1228         else:
   1229             nmesc.fixed_thres = max_rp_threshold

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in getAffinityGraphMat(affinity_mat_raw, p_value)
    349     symmetrize the binarized graph matrix.
    350     """
--> 351     X = affinity_mat_raw if p_value <= 0 else getKneighborsConnections(affinity_mat_raw, p_value)
    352     symm_affinity_mat = 0.5 * (X + X.T)
    353     return symm_affinity_mat

[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/utils/offline_clustering.py](https://localhost:8080/#) in getKneighborsConnections(affinity_mat, p_value, mask_method)
    332     indices_col = torch.arange(dim[1]).repeat(p_value, 1).T.flatten()
    333     if mask_method == 'binary' or mask_method is None:
--> 334         binarized_affinity_mat[indices_row, indices_col] = (
    335             torch.ones(indices_row.shape[0]).to(affinity_mat.device).half()
    336         )

IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [7557001], [7559750]

Expected behavior

Diarization shouldn't fail on a seemingly non-corrupt audio-file. This config has been tested multiple times before on other files without any issues.

Environment overview (please complete the following information)

Environment details


PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
=============
2.1.0+cu121
=============
Python 3.10.12
tango4j commented 9 months ago

Hi. Let us test on the wav file you provided. This is a new type of error we have never encountered. It apprears p_value value in this line is causing this error.

I will follow your settings and check what is causing this error.

Oscaarjs commented 8 months ago

@tango4j any updates on this issue?

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.