Ask for help about the net_vocal

Hello, I observed the effect of net_vocal_attributes in the whole model framework.

At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (audio_embedding_A1_pred and audio_embedding_B1_pred) can reach 2, and the distance of the positive sample pair (audio_embedding_A1_pred and audio_embedding_A2_pred) can reach about 0.

But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (audio_embedding_A1_gt and audio_embedding_B_gt) can only reach 1. That is to say, the sound feature extraction is not good when I train the net_vocal alone.

It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?

Looking forward to your reply！

facebookresearch / VisualVoice

Ask for help about the net_vocal #24