Using the model to recognize voice

giusarno commented 3 years ago

Hello, thanks for this project I managed to get the test to work immediately. I have been trying to use the model as feature extraction for d-vectors by reading the tensor value at fc7. The idea would be to compare (cosine compare) stored d-vectors against the voice recorded for a person during an enrollment process with the d-vector generated by the current recording. However I tried this for a couple of wav samples and the cosine difference doesn't seem to go below 0.6 even for different people in the set indicating the model does not work well in this scenario. How is this model meant to be used ? would it work only with the voices in the VoxCeleb set ?

Thanks in advance.

Derpimort commented 3 years ago

I can't say with 100% certainty whether this is due to the model architecture or the implementation.

Also, there's some points to be considered from the original paper

For verification, feature vectors can be obtained from the classification network using the 1024 dimension fc7 vectors, and a cosine distance can be used to compare vectors. However, it is better to learn an embedding by training a Siamese network with a contrastive loss [38]. This is better suited to the verification task as the network learns to optimize similarity directly, rather than indirectly via a classification loss

So, the author's suggestion convert fc8 to 1024D is the better performing one but the simpler alternative using fc7 as output also seems to give okayish results according to the paper.

I didn't go through the full paper again and this repo only has the identification model. So my suggestion would be to train the model again and go for the Siamese networks approach for both models after training for classification. If that works then we have our answer otherwise it might be a problem with the implementation.

EDIT: Checkout the reference repo vggvox-speaker-identification. Might be easier to just go with this instead of doing all the stuff above.

Derpimort commented 3 years ago

Keeping it open. If I get some time in the near future I'll try it out.

giusarno commented 3 years ago

Thank you for your reply. In the end I moved on and tried https://gitlab.fbk.eu/brutti/vggvox_features which is essentially the pytorch version of the one you suggested as it is referenced on gitlab. I wanted to try and stick to pytorch if possible. I got really good results with VocCeleb and TIMIT 16 datasets.

Thank you.

samyak0210 commented 3 years ago

Hello,

We are also facing a similar issue with the model. We were passing another audio to the preprocess function and extracting features from self.features instance of the class. We also followed the same procedure with test.wav and got two 4096D vectors (one with other audio and one with test.wav audio as input) and got 99.8% cosine similarity. We followed the protocol of preprocesing from signal_utils.py. I have also attached the files which I am using for this analysis. inference.zip

Can you confirm this behaviour? I have also attached the audio files on which I am testing.

Derpimort / VGGVox-PyTorch

Using the model to recognize voice #2