HarryVolek / PyTorch_Speaker_Verification

PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al.
BSD 3-Clause "New" or "Revised" License
575 stars 165 forks source link

What is the duration of audio of each D vector embedding that is created? #66

Open abhilashnayak opened 4 years ago

abhilashnayak commented 4 years ago

Hi,

Thanks for this work. I am using the output of dvector_create.py as input to uis-rnn. Diarization is also done.

But I have a small confusion on the number of d vector embeddings created. dvector_create.py created 24 embeddings for 9.7 sec audio and 21 embeddings for 8.9 sec audio. In the first case, if I consider every embedding is related to 240 milliseconds (just an assumption) of audio and add up , it does not give the full audio duration. 24 * 240 = 5760 (5.7 seconds). But my audio file is 9.7 seconds long.

Just wanted to understand this as I need to split the audio after diarization is done. The idea is, if diarization result says that first 10 embeddings are related to speaker1 and if I also know each embedding is X ms long, then 10 * X= 10X ms (10X/1000 seconds) . So I will split the audio after 10X ms seconds and so on. So without knowing from what time frame(in milliseconds) to what timeframe speaker 1 spoke and what are the timeframes for speaker 2 , I cannot split the audio.

Please help me understand this. Also is there any other way that you can suggest to split the audio.