NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.01k stars 2.5k forks source link

NeMo MSDD miss detection #8374

Closed SagyHarpazGong closed 7 months ago

SagyHarpazGong commented 8 months ago

Hi all.

I have audio with duration of about 90 minutes and it contains 8 speakers, when I'm running the MSDD I'm getting only 4 speakers. Some of the speakers have speech ratio pretty low related to the other speakers.

What can I do to get better results? Using oracle num speakers is not a solution because the speaker confusion rate is very high in that case and not always I'll know the number of speakers.

Thanks

nithinraok commented 8 months ago

Hi, If possible, we shall always appreciate any reproducible script with audio to see the error from our end. Hope you set max_num_speakers to be 8 however if speaker duration of a speaker is very low then NMESC algorithm tends to suppress those speakers, but how ever you can tweak the max_rp_threshold to improve the performance on speakers with less speech duration. Config: https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/inference/diar_infer_meeting.yaml#L48-L56 @tango4j to add more info on parameters to tune for such cases.

SagyHarpazGong commented 8 months ago

@nithinraok Thanks for the response, unfortunately I can't share the audio but I can share the clustering parameters:

 clustering:
    parameters:
      oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
      max_num_speakers: 12 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
      enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
      max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
      sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.
      maj_vote_spk_count: False  # If True, take a majority vote on multiple p-values to estimate the number of speakers.
      chunk_cluster_count: 25 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
      embeddings_per_chunk: 20000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)

I tried to change the max_rp_threshold from 0.1-0.95 and still get only 4 clusters.

tango4j commented 8 months ago

@SagyHarpazGong

Unfortunately, it is quite hard to correctly count the number of speakers if there are relatively short speech from speakers. Speaker diarization is also a type of pattern recognition system so it does not guarantee to work without errors. Other than the parameters @nithinraok mentioned, you can try changing the following params if your audio clip is long enough (over 1 hour).

      chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
      embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio) 
github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.