An alternative approach to the speaker encoder

ghost commented 4 years ago

For the encoder, i have a question... if i well understand, the target is just to maximize similarity of 2 audios of the same speaker and then minimize distance between them So, we could imagin to use another approach to train it no ? Based on the « voicemap » project i made a simple siamese network whose target is to minimize distance between 2 audios of n seconds (i tried 2 and 3) and have really good results too (i have 88-90% with binary accuracy) with only 2 or 3 hours of training on my GPU ! The process is really simple : 2 inputs (2 sec of raw audio) pass to a same encoder network then the 2 embedded (here 64 dims vectors) pass to an euclidian distance layer and then to a 1 neuron linear with sigmoid (which gives the probability that the 2 audios are of the same speaker) Here i used same length audio but i suppose 2 audios of different length can be good too and the model is only CNN so much faster and easier to train than the actual 3-layer RNN... Here is the tutorial link with code of the original voicemap project, really interesting and many fun applications i made with it https://medium.com/analytics-vidhya/building-a-speaker-identification-system-from-scratch-with-deep-learning-f4c4aa558a56

Now i plan to convert the encoder of this repo and see his loss and try to compare it with my encoer loss to see if results are similar or not (because i don’t know how to use binary accuracy with this encoder)

Originally posted by @Ananas120 in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/447#issuecomment-672644774

Ananas120 commented 4 years ago

Here are the results of my siamese encoder (with embedding size = 64 and 2 seconds of raw 16kHz audio as input) The embedding is made with 10 audios of 20 random speaker of CommonVoice (fr) dataset with the UMAP projection (the projection code of this repo) I also plot the training metrics (loss / metrics) over batch and steps

Like you can see, i trained it only for 5 epochs (10k steps) that takes me less than 1 hour on my single GPU and the result is quite good !

embedding_plot

Another point to note is that, in the original model, i don't normalize my embedding (i make it only for the plot) so perhaps it can affect the plot results ? I will retrain a model with normalization at the end of the embedding to see if it can improve results The architecture i used is exactly the same as in the voicemap article except that i added step of 4 for the first convolution because i put 16khz audio as input and not 4khz (so step of 4 reduces 16khz to the equivalent of 4 khz in term of samples)

Note : all values in the plot are for training set (because of a bug in my code, it doesn’t keep track of validation metrics) but they are really similar

ghost commented 4 years ago

My first reaction to the concept is that the approach with siamese networks is not good enough if it only achieves 90% accuracy. But after thinking some more, a weakness of the current speaker encoder is that we train it to treat each voice as distinct even though perceptually, we consider certain pairs of voices more similar than others. So maybe less accuracy is actually a good thing if it relates unseen voices in a meaningful way to the voices that the encoder is trained on.

Also, for the encoder, a higher speaker embedding size may work better for speaker ID because it has more features to discriminate between voices. But it may perform poorly for voice cloning if the encoder relies on features that humans cannot perceive. Restrict the embedding size may force the encoder to use the more tangible features. It may also help to label highly similar voices (as perceived by human) as the same voice for training.

Ananas120 commented 4 years ago

Another thing to note is that the siamese network is train to say if 2 samples of voice are from the same speaker or not and to do that it uses the euclidian distance between the embedded audios and this is a simple linear layer with sigmoid that gives the score (0 if 2 samples are same, if the target is to minimize distance or 1 if same if target is to maximize similarity, i suppose at the end it’s exactly the same for the encoder part) But then the accuracy is computed as the mean between true-positives and true-negatives (actually i have 92% with my training model)

The problem with the encoder of this repo (i find) is that we can’t get this think of metrics to evaluate the model so i can’t really compare them... The only thing i find is to see the GE2E loss of the encoder of the siamese network and as i tested it, my encoder has 0.8 loss and RNN encoder has also around 0.7 loss so... i think the 2 approachs can be interchanged but the main advantage of the siamese approach is that no processing is needed (raw audio is given) and we have meaningful metrics (accuracy, true-positive and true-negative) and it can be train with a simple binary_crossentropy loss

A second advantage is for the applications of the model itself, the encoder just embeds mels so if we want to say if 2 samples comes from same speaker, we must implement the distance threshold etc, with the siamese, the output layer says it for us and it can then have many fun applications !

ghost commented 4 years ago

the main advantage of the siamese approach is that no processing is needed (raw audio is given)

How do you handle audio files with different sample rates? I think encoder preprocessing for this repo just resamples the source audio to the desired rate. In other words I don't see the relative advantage to the GE2E approach here? My ignorance may be showing so please explain further.

Ananas120 commented 4 years ago

Oh yes indeed, resampling is needed but no mel spectrogram computation i would say (this repo uses à 40-features mel-spectrogram as input so processing is resampling and melspectrogram computation) Another thing is that i didn’t try variable sample length in the siamese (the 2 samples are fixed length) but i will launch a training with variable length when my actual training finishes

Ananas120 commented 4 years ago

Oh @blue-fish another funny thing you can try with embedding vector as the synthesizer ! As a fun application of the siamese, i made scripts to detect different speakers in an audio file and cluster them (to detect where they speach) I launched it on a radio interview with 2 people and the model returns me 3 speakers... I listened to samples of the 3 detected and, in fact, the first detected speaker was the introduction music and the 2 other were the true speakers perfectly separated !

So... what if the embedded input is a music or something else than a real human voice ??!

ghost commented 4 years ago

So... what if the embedded input is a music or something else than a real human voice ??!

You can try this with the pretrained models of this toolbox. Either load a non-voice sample, or record something with your microphone. Then try to synthesize some text. You still get intelligible speech as output because the synthesizer is only trained to make speech.

Ananas120 commented 4 years ago

Good, i haven’t tested the toolbox yet in fact but good to know, i think i will test it to see which impact it can have on the voice

Ananas120 commented 4 years ago

So the results for my tests on siamese and the actual encoder : 3-layer RNN encoder (256-embedding) :

Low processing time (melspectrogram)
Bad implementation of GE2E loss (memory efficient) (my implementation but no tf2 open-source available)
GE2E loss of 0.6 (if my results are ok but not sure...)
Can perform arbitrary length sample

Siamese encoder (64-embedding) :

No audio processing (except resampling if needed)
Very fast to train (200ms for batch size 32 on my single GPU with audios pre-loaded)
Have meaningful metrics to evaluate it (binary_accuracy, true_positives and true_negatives)
BCE loss 0.09 with 94% accuracy (and can improve it more i think !)
Theorically works with arbitrary length input but i don’t know why, it doesn’t converge (so for my best model i use 3 seconds of raw 16kHz audio)
GE2E loss of 0.8 (for my old model of 89% accuracy)

In two cases, the embedding plot seems good (but slightly better for the siamese i find)

Plan :

Adapt my Tacotron-2 arch to accept speaker embedding as input
Create speaker embedding for all speakers in my 2 Fr datasets (Common Voice and SIWIS) (with my Siamese encoder)
Create input pipeline for Tacotron-2 (including the embedding)
Transfert weights of my Fr pretrained Tacotron-2 and train it for a few days (with the speaker embedding)
Create a complete pipeline to generate audio with arbitrary voice (based on an input audio (3sec))

I just have a question, my pretrained Tacotron is pretrained for 22050Hz and my Waveglow vocoder too but my encoder uses 16kHz audio... do you think it can be a problem to use embedding from 16kHz to train the synthesizer on 22050Hz or no ? Therically i think no because this is just a speaker embedding so an abstract representation of the speaker but... not sure

Another think, my encoder is 64-length embedding, is this ok as input or is it too small ?

ghost commented 4 years ago

I just have a question, my pretrained Tacotron is pretrained for 22050Hz and my Waveglow vocoder too but my encoder uses 16kHz audio... do you think it can be a problem to use embedding from 16kHz to train the synthesizer on 22050Hz or no ? Therically i think no because this is just a speaker embedding so an abstract representation of the speaker but... not sure

Another think, my encoder is 64-length embedding, is this ok as input or is it too small ?

I agree a 16,000 Hz encoder should work with a 22,050 Hz synth/vocoder. As you say, it is just an abstract representation of the voice and there is no information being passed that depends on sample rate.

From 1806.04558 (the SV2TTS paper), it looks like the output similarity may be penalized slightly but they still achieve a MOS of 3+ with an embedding size of 64. So I think 64 is sufficient unless you have a large number of speakers in your synthesizer training dataset (1,000+). I am aware that the speaker counts in the table are based on the encoder training set... but if you do not have enough speakers for the synth, I hypothesize it cannot effectively utilize the additional dimensions in the encoder output.

Ananas120 commented 4 years ago

Update : 97% accuracy with 0.02 BCE val-loss for the best model (which i used for the embedding plot) training

embedding_plot

mueller91 commented 4 years ago

Interesting approach! Some thoughts that popped into my head:

It is not clear to my why the siamese model should require less preprocessing. I would feed it mel-specs, just as the speaker encoder in this repo. What would the advantage be of feeding in raw audio?
The Encoder (SE) in this repo is very similar to a siamese network (both take an input and produce a latent representation), except that the SE does not use L2 distance to compute the loss, but cosine distance; and the SE does not compute it for only a single pair, but for 640 audios in a pairwise fashion.
L2 suffers from the curse of dimensionality, whereas cos-sim does not; thus, i'd favour the SE.

In general, i'm having a hard time to see what the siamese network should produce better embeddings? Could you elaborate on this? What am i missing? And as for better metrics, you can just measure the cosine similarity within all pairwise-similar audios (and pairwise dissimilar audios) and get an estimate how close / far these pairs are mapped on the hypersphere by the SE.

Ananas120 commented 4 years ago

@mueller91 in the approach of siamese in the paper i follow, he passes raw audio as input of the model and then less processing is needed compared to this repo (where you have to process the spectrogram)

The other point is that i don’t use the L2 loss but a BinaryCrossentropy loss on the final layer of the siamese because the siamese has an encoder and a « decoder » part : the decoder takes the 2 embeddings, calculate distance (i use euclidian) and after that, a 1-neuron dense layer with sigmoid gives the distance (between 0 and 1) or the probability that they are same (also between 0 and 1) and then you can just use a BCE loss on that decision

Another drawback of my approach is that i can’t use unfixed length audio samples (i don’t understand why, maybe not enough training)

Yes the GE2E loss with cosine similarity is a good approach too and it can be interesting to compare them with real experiments (same models, training set, metrics, ...) For me i use the siamese because i find it fun, more expressive and my GE2E loss is not very efficient (very memory efficient) and then training is much longer

mueller91 commented 4 years ago

Thank you so much for your answer!

It would be interesting to see if a synthesizer conditioned on your siamese embeddings can produce better audios. Are planning on trying this any time soon?
you're still computing euclidean distance (L2 distance) which is subject to the curse of dimensionality. Still, it might still work for n=64 or even n=256.
you probably cannot use variable length audios because your architecture does not contain a RNN / LSTM / GRU (recurrent) component. (Sidenode: These tend not to work with raw audio because the temporal dependencies are too large) I'm curious to learn more about the results from your experiments.

Ananas120 commented 4 years ago

I am training a Tacotron-2 model with my embeddings right now (you can see the other issue « pytorch synthesizer » in this repo where i give some results / ...)

But i am training a tacotron-2 which is not exactly the same as the model of this repo and training with embeddings of size 64 only so i don’t know if it’s not working (actually) because the model is (a little) different or because the embedding is too small or just because it will never work with my siamese embeddings

I have no RNN but i have a GlobalMaxPooling layer so it reduces the dimension to be usable and normally the maximum of a speaker should be similar in whole audios (because it should be similar between different audios) so i don’t really understand why it doesn’t work Because if the GlobalMaxPool is similar between 2 samples, it should be similar for an audio of arbitrary length so... it should work in theory no ?

Ananas120 commented 3 years ago

To see progress on this approach and results of Tacotron-2 using this siamese encoder, you can see the #507 where i describe my training results and procedure with a tf2.0 implementation of the synthesizer (slightly different from the Tacotron of this repo)

ghost commented 3 years ago

Please see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/507#issuecomment-706963626

Ananas120 commented 3 years ago

If you continue your experiments and achieve good performance (or just interesting inference and not only noise like for me) let me know please it can maybe help me in the tensorflow implementation !

CorentinJ / Real-Time-Voice-Cloning

An alternative approach to the speaker encoder #484