WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
362 stars 98 forks source link

Use custom feature vector instead of thin-resnet #55

Open clintonlau opened 4 years ago

clintonlau commented 4 years ago

Thanks for the comprehensive code and paper.

I am playing around with merging custom extracted features like MFCC with your speaker recognition model. I want to replace the thinResNet-34 stage with MFCC feature vectors feeding directly into the NetVLAD model. MFCC is similar to your input spectrogram in size (frequency channels on the first axis and time frames on the second axis, i.e. channels x T x 1). I see that the input size to the NetVLAD model is a tensor of (1 x T/32 x 512), do you think directly feeding feature vectors into the NetVLAD model is possible? How should i modify the feature vector to conform to the NetVLAD model?