NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.3k stars 2.55k forks source link

Speaker Diarization - Extracting speaker embeddings for labels from files with multiple speakers from MSDD models #6965

Closed asusdisciple closed 1 year ago

asusdisciple commented 1 year ago

Is your feature request related to a problem? Please describe.

One problem I come across often is, that I can't verify already known speakers from other files, since inference only yields labels for a MSDD model approach. For example: I have file A with speakers Paul, Tom and Charlie and File B with speaker Bert, Charlie and Johann. To determine if Charlie is in File A and B, I have to get his speaker embedding. At the moment I have to infer speaker diarization on file A to do this, get the timestamps for Charlie, cut the file at the marks, stitch it together again and get the corresponding feature embedding which I can then compare to file B, by doing the same. However for a production setup I would need the embedding from the MSDD model, but from my understanding the MSDD model only produces a probability vector?

Describe the solution you'd like

A solution would be to extract the feature embeddings with some kind of method, after diarization and associate them with the labels. I found the get_cluster_avg_embs_model() method for MSDD models, but I am not sure how to use it and if it yields the correct results (to me its not clear which parameters to use here, since there are a lot of internal states in the model after diarization). So the question would, which parameter yields the best representation of a speaker to recognize him again?

nithinraok commented 1 year ago

Are you looking to get representative embedding per each cluster label? @tango4j you did this for online diarization right? Is it part of main?

tango4j commented 1 year ago

@asusdisciple Please take a look this part of the code: MSDD forward function clus_label_index is containing the indices of the speakers that are in the output RTTM, CTM, JSON files. (e.g. speaker_0, speaker_1) Now, bring this speaker index to grab the average speaker embedding from the variable ms_avg_embs. Shape: (batch_size, scale_n, emb_dim, self.num_spks_per_model) for example, ms_avg_embs[0, :, :, spk_index] will give you (scale_n, emb_dim) dimension of embeddings for the spk_index. In this way, you can associate speaker_<x> labels with the average speaker embeddings. Or, by itself, the RTTM, CTM, JSON file index is associated with the speaker indices during NeMo speaker diarization system.

Please let us know if you still have trouble figuring out this issue.

asusdisciple commented 1 year ago

I will try this, thanks this helped a lot.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.