Open yangyyt opened 3 months ago
In my experiments loss doesn't change at all and stuck ~0.3
, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).
eval_audio
is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks
My training loss dropped from 2.x to 1.X, and I trained for 1200+ steps. I only used the data from libriTTS, don't know if it's normal.
In my experiments loss doesn't change at all and stuck ~
0.3
, but i can observe the quality change and it improves and improves the longer i train, the longest i have trained 400k iterations max on two GPUs so far, not sure what would be later (i expect it to be better).
eval_audio
is modern one, i haven't updated main eval. Vocoder is pretrained, it is provided in eval notebooks
It has dropped to about 0.3 today. I will test it to see the effect.
I have updated all code in eval notebook, also published how to use instructions
I have updated all code in eval notebook, also published how to use instructions
Thanks a lot, I used the eval_audio.ipynb file to test my model. I found that the effect was not as good as yours. I am going to check my model.
Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).
Style tokens (which are in fact just normalised pitch) improved emotional prosody a lot. Some of my notebooks has an example of inference without style tokens (I am training 10% without them to make it work).
The generation is now normal. It seems that there is something wrong with the input of the audio model sample. and the log mel spec needs to be normalized (std: 2.1615, mean: -5.8843), but why is this step added? The spectrum was not normalized during model training. And how are std and mean calculated? I have another question is, how was your voice_x.pt generated?
It is normalized during training, this numbers are from voicebox paper, but i feel for my data they should be different, but i am not careful enough yet.
voice_x.pt
is generated using generate_voices.py
from the root of the repo.
got it, thank you.
@ex3ndr have you thought of using the semantic model from WhisperSpeech?
@zvorinji hey, i am not convinced that Whisper has anything useful, i tried in the past to use it's latent outputs to predict presence of the voice, but it turns out training from scratch was much easier task. wav2vec would be more reasonable alternative, but honestly semantic-wise it is enough to have phonemes with pitch.
What is really missing is emotions and non-semantic information.
Regarding model training and inference, I have a few questions that I would like to ask.
Thank you very much for your reply.