Test time speaker adaptation

vishalbhavani commented 1 year ago

I tried the pretrained model and the one-shot vc results are not good. Is there a way to do TSA just like NANSY on a few examples of the speakers to get a better speaker identity representation?

b04901014 commented 1 year ago

Hi, good question!

We didn't focus much on this, but we can do the exact same TSA algorithm in NANSY for the speaker conversion model. We can just view the test time utterance as new training data and fine-tune the model on it for a few steps (optionally freeze some parameters). However, I assume you are referring to the poor accent transferability, judging from the samples. And I would like to have some clarification before advising you to try TSA.

There are some limitations about transferring the accent when using the pretrained models. I personally believe the causes are:

Most of our training utterances are native spoken English
Inherent nature of SSL features (different accent may be represented by different discrete units).

This should help mitigate (1). However, it will not help (2) if the accent is already encoded in the SSL features. In this case, converting accent will require conversion of source SSL features, which our model (or NANSY) cannot achieve. I think in terms of accent transferability, ASR-TTS cascade VC systems still outperforms SSL-based approaches like our model and NANSY.

One simple evidence, you can listen to the samples of NANSY++ (A recent improved model of NANSY). Listen to the third sample of Zero-Shot Voice Conversion, their model also outputs native accent English despite the heavy accented target speech.

I think at current VC systems utilizing SSL features, are still subject to only conversion of speaker F0 and timbre, but cannot transfer accent well. I would imagine an interesting direction improving the accent conversion by working on the discrete units (e.g., having less clusters to force the units to encode less diverse realization of phonemes, or condition duration prediction on speaker embedding since accent is closely related to timing).

vishalbhavani commented 1 year ago

Thanks for the clarification. I was more concerned about the identity mismatch than the accent difference. The voice texture of boman and prosenjit is not preserved in the generated audios. I tried with other source audios but the output texture didn't change. I presumed that the problem might be in speaker representation extracted from the target audio, hence the question about TSA.

Also, is the voice texture different because the target audio is accented?

b04901014 commented 1 year ago

Got it. In this case, I second you. It should be an issue of speaker embedding extractor and TSA should help.

It should be straightforward to apply TSA on fine-tuning the speaker embedding extractor, but require some work for implementation (basically have a training loop inside inference.py).

I'll keep that in mind but I cannot guarantee when I will implement TSA function for this repo.

vishalbhavani commented 1 year ago

Thanks for the confirmation. I can send a PR with TSA support. I am facing an issue - when I try to use the predicted mels from the code in the inference function they have two extra frames (1, 863, 80) compared to the mels extracted using mel_spectrogram function (1, 861, 80). What am I missing?

b04901014 commented 1 year ago

During training, you can pass the target Mel-spectrogram length to the self.Ep and self.u2m (see line 123, 126 in trainer.py) to force the output shape to be the same as your target Mel.

The slight mismatch is caused since I use a simple scale 1.73 to calculate the expected Mel-spectrogram length without considering the paddings/truncations during inference.

vishalbhavani commented 1 year ago

Directly optimizing for L1 loss using the code in inference.py(with the mel length fix) results in further deterioration. I can see that the forward pass in train and inference differ. Is there anything that I should carry from there?

b04901014 commented 1 year ago

It's hard to judge without the code. I could think of some pitfalls:

The only trainable modules should be self.RVEncoder.wav2vec2.spk_encoder and self.RVEncoder.linear_spk if we want to target the speaker encoder as the original TSA does. Although I suspect training the whole system will make the result worse.

Also, you need to feed in the ground truth energy, pitch instead of the predicted ones for training. You can see in inference_exact_pitch.py on how to get those features. This one may be the culprit.

We also use adversarial loss during training. The checkpoint also contains all the parameters to recover these.

vishalbhavani commented 1 year ago

Actually, I kept the speaker encoding tensor trainable(initialized to tgt_attributes["a_s"]) and froze all parameters of Tester . Based on your suggestion, I used inferece_exact_pitch which improved the results. Trained on l1 loss for now but the result became noisy again. I presume the only missing thing is the adversarial loss?

b04901014 commented 1 year ago

I see. Hmm. I'm not sure unplugging the adversarial loss will have that a huge effect on performance. I am instead wary about the discarding the speaker encoder and replacing it as trainable parameters. Now the model have no access to how the target speaker sounds for it to capture speaker information.

vishalbhavani commented 1 year ago

You mean plugging in the adversarial loss right? I haven't used it yet
The TSA proposed in NANSY does exactly this. Keeps the speaker representation trainable and backdrops the L1 loss. The loss with ground truth mels is the signal which will tell the model how the target speaker sounds. Is there an alternate approach in your mind?

b04901014 commented 1 year ago

I mean that without adversarial loss it should still work, the original TSA also only use L1 loss. But it didn't so there is maybe something else not working.
Not quite. They apply TSA on the linguistic feature sequence but not the speaker embedding. It means their TSA are intended to solve wrong pronunciations on unseen languages instead of discrepancy of unseen speakers. I am not sure whether the same approach can be utilized on a single dimension representation. However, it should not get worse or noisy either in my opinion. Maybe you can post the code here and we can see if there are other problems.

vishalbhavani commented 1 year ago

I agree with both. Sharing the code. Let me know if you want to look at the audio samples as well. code.zip inference_exact_pitch.py contains the exact original code which is reformatted inference_exact_pitch_2.py contains the TSA modifications to the above file

b04901014 commented 1 year ago

I think the learning rate is too high. Can you try 5e-5? Also did the loss actually decrease? Yeah it would be great to have some samples.

vishalbhavani commented 1 year ago

It is indeed high. I deliberately increased it while doing initial experiments and forgot to revert it back. Yes, loss is going down. Tried 5e-5 but the improvement felt slow. Tried 1e-3 and 1e-2 as well. Latter gave the best results out of all. The audio is also cleaner compared to lr=1 case. Attaching loss values and samples samples_lr1e-2.zip for 1e-2. I'll do more experiments to figure out the best hyperparameters.

In my experience l1 loss less than 0.35 implies good reconstruction of target samples but the best I got was up to 0.42 with 10k epochs(And the audio had noise). Lower l1 loss with higher noise is usually solved by adding adversarial losses. Is it worth trying here?

Do you think increasing the target audio duration will help here? (I am currently using a 3 sec sample) Technically longer duration implies better gradients but in my experience, speaker identity extraction works well on shorter clips. Longer duration might also help extract a better representation for conversion by avoiding overfitting.

0 0.8427425622940063
50 0.6483885049819946
100 0.5941509008407593
150 0.5679752230644226
200 0.5498141050338745
250 0.5359079837799072
300 0.5285325050354004
350 0.5234791040420532
400 0.5187109708786011
450 0.5140001177787781
500 0.5097030401229858
550 0.5050584673881531
600 0.5010486245155334
650 0.4975414276123047
700 0.4940493702888489
750 0.49026718735694885
800 0.48776012659072876
850 0.48567694425582886
900 0.4837939739227295
950 0.4816649556159973

vishalbhavani commented 1 year ago

Also, why do we need https://github.com/b04901014/UUVC/blob/master/inference_exact_pitch.py#L162 ? I had to comment it out to make the shapes match

b04901014 commented 1 year ago

Also, why do we need https://github.com/b04901014/UUVC/blob/master/inference_exact_pitch.py#L162 ? I had to comment it out to make the shapes match

You are right. It should be redundant in this context since for speaker conversion we are not changing the duration and can use the source melspec length for the output melspec length. So we don't need to estimate the output length by the scaling factor.

The original code on its own should work also fine. I think you get shape error since you directly send the source melspec shape to the model as I suggested earlier for TSA.

b04901014 commented 1 year ago

It is indeed high. I deliberately increased it while doing initial experiments and forgot to revert it back. Yes, loss is going down. Tried 5e-5 but the improvement felt slow. Tried 1e-3 and 1e-2 as well. Latter gave the best results out of all. The audio is also cleaner compared to lr=1 case. Attaching loss values and samples samples_lr1e-2.zip for 1e-2. I'll do more experiments to figure out the best hyperparameters.

In my experience l1 loss less than 0.35 implies good reconstruction of target samples but the best I got was up to 0.42 with 10k epochs(And the audio had noise). Lower l1 loss with higher noise is usually solved by adding adversarial losses. Is it worth trying here?

Do you think increasing the target audio duration will help here? (I am currently using a 3 sec sample) Technically longer duration implies better gradients but in my experience, speaker identity extraction works well on shorter clips. Longer duration might also help extract a better representation for conversion by avoiding overfitting.
0 0.8427425622940063
50 0.6483885049819946
100 0.5941509008407593
150 0.5679752230644226
200 0.5498141050338745
250 0.5359079837799072
300 0.5285325050354004
350 0.5234791040420532
400 0.5187109708786011
450 0.5140001177787781
500 0.5097030401229858
550 0.5050584673881531
600 0.5010486245155334
650 0.4975414276123047
700 0.4940493702888489
750 0.49026718735694885
800 0.48776012659072876
850 0.48567694425582886
900 0.4837939739227295
950 0.4816649556159973

Lower L1 loss but higher noise may also be overfitting on that specific example. Since we now have only one sample, it will be rough for adversarial optimization IMO. And that may also be why NANSY didn't apply adversarial loss on their TSA.
The more (clean) data is always (at least in this context) better if we run TSA. If we do not run TSA, I think there will still be some improvements but may be quite marginal.

vishalbhavani commented 1 year ago

Also, the reconstructed audios have different volumes compared to the original audios. Any idea why is that?

b04901014 commented 1 year ago

Also, the reconstructed audios have different volumes compared to the original audios. Any idea why is that?

I used loud normalization at the inference.py, try to comment out https://github.com/b04901014/UUVC/blob/master/inference.py#L83

vishalbhavani commented 1 year ago

So my current code has the following changes:

TSA to reconstruct target audio
Speaker identity as a parameter and freeze the model completely
Use exact duration and exact energy
Reduce LR and train for longer(1e-2, 5000 epochs)
remove loudness normalization(for better comparison)
Train on l1+adversarial loss
Train on longer audio(5 min)

Now the identity match is better but it has the following problems(new_samples.zip):

Audios are noisier than earlier
Content is not captured correctly from the source(verified by reconstructing source audio)

Proposal: 1.

Vocoder GTA fine-tuning proposed in assem-vc
Keep some of the layers trainable in TSA(Any suggestions on that?) 2.
TSA on source audio with identity and content representations trainable(Should I consider any other aspects trainable?)

Do you have any more thoughts on getting this to work?

b04901014 commented 1 year ago

The identity seems to really get better! But there do have some weird noise there which should originally be caught by the adversarial loss. What you are doing sounds good to me. One additional suggestion may be trying to randomly batch the 5 min speech to add stochastic property to gradient decent, which may help the adversarial training.

If you want to keep some layers trainable, maybe we can start by making all the parameters trainable and see if that improves or not.

mvoodarla commented 1 year ago

This was an awesome thread to read through. Any chance you could share some of this code as a PR or a fork @vishalbhavani?

b04901014 / UUVC

Test time speaker adaptation #6