Closed toshikwa closed 6 years ago
@ku2482 Thanks for your question.
Regarding your questions:
@astorfi Thanks for answering.
I think every 0.81 second audio file result in (80, 40) feature, and you concatenate 20 features to make (20, 80, 40) feature for development phase, is it right? I don't know how many (20, 80, 40) features per speaker do you use in the paper . You use just one (20, 80, 40) feature for one speaker and make the dataset shaped (511, 20, 80, 40) ??
Anyway, I appreciate for your work and kindness.
@ku2482 Yes, that's quite correct.
For the second part, (20, 80, 40) features are fed to the network. "20" is the number of spoken utterances for the speaker. However, there is no restriction on the number of (20, 80, 40) features for any speaker. The rule of thumb is "the more is the better for background model generation". You can use "20" spoken utterances at random for data augmentation (although all needs to belong to the same speaker).
@astorfi Thank you so much!!
I actually solve all my questions and now I can understand your script. Your work is really great!!
I close this issue, and again, thank you!!
Hi @astorfi I have some questions about input dataset.
According to the paper, the number of speakers is 511 in the development phase. But how long is the input audio file per speaker ??
Although there is the function of CMVN preprocessing in input_feature.py, I'm not sure whether CMVN preprocessing is appropriate for the output of speechpy.feature.lmfe function. Did you use CMVN preprocessing in the experiment of the paper??
Thank you for your work!!