Closed SagyHarpazGong closed 7 months ago
Hi,
If possible, we shall always appreciate any reproducible script with audio to see the error from our end. Hope you set max_num_speakers
to be 8 however if speaker duration of a speaker is very low then NMESC algorithm tends to suppress those speakers, but how ever you can tweak the max_rp_threshold
to improve the performance on speakers with less speech duration.
Config:
https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/inference/diar_infer_meeting.yaml#L48-L56
@tango4j to add more info on parameters to tune for such cases.
@nithinraok Thanks for the response, unfortunately I can't share the audio but I can share the clustering parameters:
clustering:
parameters:
oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
max_num_speakers: 12 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.
maj_vote_spk_count: False # If True, take a majority vote on multiple p-values to estimate the number of speakers.
chunk_cluster_count: 25 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 20000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)
I tried to change the max_rp_threshold from 0.1-0.95 and still get only 4 clusters.
@SagyHarpazGong
Unfortunately, it is quite hard to correctly count the number of speakers if there are relatively short speech from speakers. Speaker diarization is also a type of pattern recognition system so it does not guarantee to work without errors. Other than the parameters @nithinraok mentioned, you can try changing the following params if your audio clip is long enough (over 1 hour).
chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Hi all.
I have audio with duration of about 90 minutes and it contains 8 speakers, when I'm running the MSDD I'm getting only 4 speakers. Some of the speakers have speech ratio pretty low related to the other speakers.
What can I do to get better results? Using oracle num speakers is not a solution because the speaker confusion rate is very high in that case and not always I'll know the number of speakers.
Thanks