Closed asusdisciple closed 1 year ago
Are you looking to get representative embedding per each cluster label? @tango4j you did this for online diarization right? Is it part of main?
@asusdisciple
Please take a look this part of the code:
MSDD forward function
clus_label_index
is containing the indices of the speakers that are in the output RTTM, CTM, JSON files. (e.g. speaker_0, speaker_1)
Now, bring this speaker index to grab the average speaker embedding from the variable ms_avg_embs
.
Shape: (batch_size, scale_n, emb_dim, self.num_spks_per_model)
for example, ms_avg_embs[0, :, :, spk_index]
will give you (scale_n, emb_dim)
dimension of embeddings for the spk_index
.
In this way, you can associate speaker_<x>
labels with the average speaker embeddings.
Or, by itself, the RTTM, CTM, JSON file index is associated with the speaker indices during NeMo speaker diarization system.
Please let us know if you still have trouble figuring out this issue.
I will try this, thanks this helped a lot.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Is your feature request related to a problem? Please describe.
One problem I come across often is, that I can't verify already known speakers from other files, since inference only yields labels for a MSDD model approach. For example: I have file A with speakers Paul, Tom and Charlie and File B with speaker Bert, Charlie and Johann. To determine if Charlie is in File A and B, I have to get his speaker embedding. At the moment I have to infer speaker diarization on file A to do this, get the timestamps for Charlie, cut the file at the marks, stitch it together again and get the corresponding feature embedding which I can then compare to file B, by doing the same. However for a production setup I would need the embedding from the MSDD model, but from my understanding the MSDD model only produces a probability vector?
Describe the solution you'd like
A solution would be to extract the feature embeddings with some kind of method, after diarization and associate them with the labels. I found the get_cluster_avg_embs_model() method for MSDD models, but I am not sure how to use it and if it yields the correct results (to me its not clear which parameters to use here, since there are a lot of internal states in the model after diarization). So the question would, which parameter yields the best representation of a speaker to recognize him again?