Speaker Diarization - Extracting speaker embeddings for labels from files with multiple speakers from Cluster Diarizer models

sunraymoonbeam commented 8 months ago

Hello,

I'm currently engaged in diarization for meetings domains using the ClusteringDiarizer model from the pre-trained pipeline, along with the meeting configuration (_diar_infermeeting.yaml), which has proven quite effective for my use case, just relying on listening by ear. Thanks for creating such a great library! My goal is to acquire cluster-averaged speaker embeddings for each speaker based on the clustering results. However, I'm facing challenges in finding a direct method to retrieve these embeddings from the ClusteringDiarizer module.

I came across a related issue (#6965) for EncDecDiarLabelModel (MSDD) models, where it was recommended to obtain the ms_avg_embs variable from the MSDD forward function, which utilizes the _get_cluster_avg_embsmodel function.

Problem with MSDD model

Thus, I tried to use the MSDD model instead of the ClusteringDiarizer model instead. However, I am facing some challenges as the MSDD model for the meeting domain has not been released (#6748). When attempting to load the MSDD model with the meeting configuration (diar_infer_meeting.yaml) and the "diar_msdd_telephonic" model path, I encounter a size mismatch error after running diarize(). This is likely due to differences in parameters and configurations between domains. I could train a MSDD model on a meeting domain corpus with the correct configurations but that would take some time and effort

Workaround with get_cluster_avg_embs method

As a workaround, I employed the following approach:

msdd_model.clustering_embedding.prepare_cluster_embs_infer()
msdd_model.clustering_embedding.emb_sess_test_dict

I opted for the _prepare_cluster_embsinfer() method, which internally calls the _run_clusteringdiarizer() method, and, in turn, uses the _get_cluster_avgembs method. I then accessed msdd_model.clustering_embedding.emb_sess_test_dict to retrieve cluster average embeddings. Essentially, I am running the clustering diarizer step with the MSDD model without feeding the output into the neural diarizer model.

I have 3 questions:

Question 1: get_cluster_avg_embs vs get_cluster_avg_embs_model method: Correct Approach?

I would like your thoughts on this approach for obtaining cluster average embeddings and whether it is the correct strategy. I'm unclear about the distinctions between _get_cluster_avgembs and _get_cluster_avg_embsmodel. The former is used by prepare_cluster_embs_infer(), and the latter is used in the MSDD forward function. Do they essentially return the same information? The documentation mentions that _get_cluster_avgembs returns _emb_sess_avgdict, a dictionary containing speaker mapping information and cluster-average speaker embedding vectors. The emb_sess_avg_dict keys appear to correspond to the scales (0, 1, 2, etc.), and the associated values are embeddings with a shape of (192, num_of_speakers). I'm also little confused by why the number of identified speakers in the prediction RTTM doesn't match the specified num_of_speakers, perhaps you can provide some insight into this!

Question 2: Obtaining cluster averaged embeddings from the ClusteringDiarizer model instead of MSDD

I'm also curious if there's a more direct way to retrieve embeddings from the clustering diarizer module without relying on this workaround with the MSDD model. The outcomes from the clustering diarizer module, using both diarize() and the MSDD model's .prepare_cluster_embs_infer(), exhibit slight differences (4 vs 5 speakers, the statistics were roughly the same but it seemed that two clusters got merged where one of the cluster was very small in size), which is quite confusing to me as I expected them to utilize the same methods. The MSDD model also seems to run much slower then the ClusteringDiarizer Model.

Question 3: Best representation of embeddings for each speaker

Lastly, I'm seeking guidance on obtaining the most reliable representation of the representative embeddings for each speaker, considering multiple scales. My assumption is that a straightforward approach would be to calculate the average of all the embeddings for each scale. But it is mentioned that decisions are made ultimately on the base scale....if I would like to compare embeddings for speaker recognition or visualise them on a plot, is average or taking the base scale better?

Thank you once again!

Zack

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

maxpain commented 4 months ago

the same question

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

tango4j commented 2 months ago

Hi @sunraymoonbeam @maxpain Sorry for missing this question and not answering this in timely manner. Multi-scale diarization decoder (MSDD) model only supports diar_infer_telephonic.yaml. So you cannot use diar_infer_meeting.yaml settings.

Question 1 get_cluster_avg_embs_model() function is as the name suggests, it is designed for model training. get_cluster_avg_embs() is for inference case, so for your purpose I think using get_cluster_avg_embs() function makes more sense.
Question 2 MSDD inference requires the Clustering Diarizer results, so MSDD inference takes more time than clustering diarizer. Also, you will get different results depending on scale length, scale weights when you run clustering diarizer. Check if you use the same Titanet model, the same scale lengths and scale weights. Then it would generate the same results.
Question 3 If you want to use speaker embeddings for speaker verification, definitely go with the longer scales. However, to get the best results, extract TitaNet speaker embeddings from 5\~10 seconds of audio. I recommend you to avoid using the averaged embeddings from diarization process (these are two short), and I suggest to use TitaNet stand alone model to generate speaker embedding from 5~10 seconds of audio from the final diarization results.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

NVIDIA / NeMo