Open venkatesh-1729 opened 7 years ago
Hi @venkatesh-1729
Hi @cyrta Thanks for the reply.
We aggregate the sigmoid outputs by summing all outputs class-wise over the whole audio excerpt to obtain a total amount of activation for each entry and then normalizing the values by dividing them with the maximum value among classes. The analysis of those embeddings in time allows the system to detect speaker change and identify the newly appearing speakers by comparing the extracted and normalized embedding with those previously seen. If the cosine similarity metric between the embeddings is higher than a threshold, fixed at 0.4 after a set of preliminary experiments, the speaker is considered as new. Otherwise, we map its identity to the one corresponding to the nearest embedding.
Also, what is the input shape to the network (or shape of input STFT). These details are not present in the paper. If possible can you give Keras model summary which clears the confusion?
Regards.
Hi @cyrta, Can you please elaborate this paragraph in the paper. This is my understanding please correct me if I am wrong.
we use activations from the last layer of neural network as speaker embeddings
. This is weird because the last layer would be softmax layer according to the loss function of the network. Or you meant to say that there is a dense layer with sigmoid activation before softmax layer and its activation are used as speaker embeddings. What is the size of the embeddings that are being extracted ?Thanks.