ex3ndr / supervoice

VoiceBox neural network implementation
73 stars 6 forks source link

Model convergence and inference #4

Open yangyyt opened 3 months ago

yangyyt commented 3 months ago

Regarding model training and inference, I have a few questions that I would like to ask.

  1. The model has started training. How much will the loss finally converge to? The performance will be good.
  2. For model inference code, should I refer to eval.ipynb or eval_audio.ipynb? In eval.ipynb, I don't find model.tts, model.audio_model, model.duration_model;In eval_audio.ipynb, Do I need to train a vocoder model and then test it?

Thank you very much for your reply.

ex3ndr commented 3 months ago

In my experiments loss doesn't change at all and stuck ~0.3, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).

eval_audio is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks

yangyyt commented 3 months ago

My training loss dropped from 2.x to 1.X, and I trained for 1200+ steps. I only used the data from libriTTS, don't know if it's normal.

yangyyt commented 3 months ago

In my experiments loss doesn't change at all and stuck ~0.3, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).

eval_audio is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks

It has dropped to about 0.3 today. I will test it to see the effect.

ex3ndr commented 3 months ago

I have updated all code in eval notebook, also published how to use instructions

yangyyt commented 3 months ago

I have updated all code in eval notebook, also published how to use instructions

Thanks a lot, I used the eval_audio.ipynb file to test my model. I found that the effect was not as good as yours. I am going to check my model.

  1. I only used libriTTS data;
  2. I didn’t use the style feature; Not sure if these two have much impact, I'm going to see why.
ex3ndr commented 3 months ago

Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).

yangyyt commented 3 months ago

Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).

The generation is now normal. It seems that there is something wrong with the input of the audio model sample. and the log mel spec needs to be normalized (std: 2.1615, mean: -5.8843), but why is this step added? The spectrum was not normalized during model training. And how are std and mean calculated? I have another question is, how was your voice_x.pt generated?

ex3ndr commented 3 months ago

It is normalized during training, this numbers are from voicebox paper, but i feel for my data they should be different, but i am not careful enough yet. voice_x.pt is generated using generate_voices.py from the root of the repo.

yangyyt commented 3 months ago

got it, thank you.

zvorinji commented 3 months ago

@ex3ndr have you thought of using the semantic model from WhisperSpeech?

ex3ndr commented 3 months ago

@zvorinji hey, i am not convinced that Whisper has anything useful, i tried in the past to use it's latent outputs to predict presence of the voice, but it turns out training from scratch was much easier task. wav2vec would be more reasonable alternative, but honestly semantic-wise it is enough to have phonemes with pitch.

What is really missing is emotions and non-semantic information.