astorfi / 3D-convolutional-speaker-recognition

:speaker: Deep Learning & 3D Convolutional Neural Networks for Speaker Verification
Apache License 2.0
780 stars 274 forks source link

What is the exact meaning of "utterances"? #46

Closed Chegde8 closed 5 years ago

Chegde8 commented 5 years ago

The term utterances has not been defined anywhere in the paper. I am new to the field of speaker recognition. Can someone tell me what utterances means in the context of this project?

Thanks in advance

MSAlghamdi commented 5 years ago

speech per chosen unit of time

Chegde8 commented 5 years ago

@MSAlghamdi , thank you! In the code is this the number of wav files for each speaker?

MSAlghamdi commented 5 years ago

@Chegde8 , after some signal processing steps on the wav file, it takes it and divides it in chunks of utterances. As I understood from the paper (if im not mistaken), each utterance is 800 ms sound sample. Each 10 ms (one overlapped frame) of the utterance has 40 MFEC features (now we have 80 frames x40 features). As the paper states "Each input feature map has the dimensionality of ζ × 80 × 40 which is formed from 80 input frames and their corresponding spectral features, where ζ is the number of utterances used in modeling the speaker during the development and enrollment stages. By default we set ζ = 20." ζ is the depth of the 3D.

Chegde8 commented 5 years ago

Got it. Thank you!