Hi,I have some questions (this is my graduation project and this is extremly important for me )
-the output of embedder-net() function is a [N,256] I need to understand what is N exactly is it the number of sliding windows (240ms)?
-Can we use this output (embedder-net() function output) for speaker diarization (can we apply clustering algorithms to this sequences for speaker diarization)?
-Can I understand how did you build train-sequence and train-cluster-id (the input of uis rnn ) because my dataset is different from TIMIT-corpus (Timit-corpus is a speaker recognition dataset not a speaker diarization dataset )?
this is a link to the corpus I am using : https://github.com/EMRAI/emrai-synthetic-diarization-corpus
Thank you in advance for your help
(Answering #41 ) The align_embeddings averages the "window level embeddings" into the "segment level d-vectors"
Since this is for your graduation project, I recommend reading https://arxiv.org/pdf/1810.04719.pdf, particularly section 2. The dvector_create script just follows what is described in that section.
Hi,I have some questions (this is my graduation project and this is extremly important for me )
-the output of embedder-net() function is a [N,256] I need to understand what is N exactly is it the number of sliding windows (240ms)? -Can we use this output (embedder-net() function output) for speaker diarization (can we apply clustering algorithms to this sequences for speaker diarization)? -Can I understand how did you build train-sequence and train-cluster-id (the input of uis rnn ) because my dataset is different from TIMIT-corpus (Timit-corpus is a speaker recognition dataset not a speaker diarization dataset )? this is a link to the corpus I am using : https://github.com/EMRAI/emrai-synthetic-diarization-corpus Thank you in advance for your help