I am playing around with merging custom extracted features like MFCC with your speaker recognition model. I want to replace the thinResNet-34 stage with MFCC feature vectors feeding directly into the NetVLAD model. MFCC is similar to your input spectrogram in size (frequency channels on the first axis and time frames on the second axis, i.e. channels x T x 1). I see that the input size to the NetVLAD model is a tensor of (1 x T/32 x 512), do you think directly feeding feature vectors into the NetVLAD model is possible? How should i modify the feature vector to conform to the NetVLAD model?
Thanks for the comprehensive code and paper.
I am playing around with merging custom extracted features like MFCC with your speaker recognition model. I want to replace the thinResNet-34 stage with MFCC feature vectors feeding directly into the NetVLAD model. MFCC is similar to your input spectrogram in size (frequency channels on the first axis and time frames on the second axis, i.e. channels x T x 1). I see that the input size to the NetVLAD model is a tensor of (1 x T/32 x 512), do you think directly feeding feature vectors into the NetVLAD model is possible? How should i modify the feature vector to conform to the NetVLAD model?