Closed Chegde8 closed 5 years ago
speech per chosen unit of time
@MSAlghamdi , thank you! In the code is this the number of wav files for each speaker?
@Chegde8 , after some signal processing steps on the wav file, it takes it and divides it in chunks of utterances. As I understood from the paper (if im not mistaken), each utterance is 800 ms sound sample. Each 10 ms (one overlapped frame) of the utterance has 40 MFEC features (now we have 80 frames x40 features). As the paper states "Each input feature map has the dimensionality of ζ × 80 × 40 which is formed from 80 input frames and their corresponding spectral features, where ζ is the number of utterances used in modeling the speaker during the development and enrollment stages. By default we set ζ = 20." ζ is the depth of the 3D.
Got it. Thank you!
The term utterances has not been defined anywhere in the paper. I am new to the field of speaker recognition. Can someone tell me what utterances means in the context of this project?
Thanks in advance