gemelo-ai / vocos

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
https://gemelo-ai.github.io/vocos/
MIT License
695 stars 82 forks source link

32kHz Vocos Multi Speaker Model Training Log #48

Open LEECHOONGHO opened 4 months ago

LEECHOONGHO commented 4 months ago

Training Loss, Generated Outputs.

I hope this will be a reference for model training.

https://api.wandb.ai/links/xi-speech-team/k0kdfwch

patriotyk commented 2 months ago

Do you have a standard tensorboard logs? It is interesting to compare.

LEECHOONGHO commented 2 months ago

@patriotyk Sorry, I've change the code to log on WandB server. I have no local logging files nor tensorboard logs.

patriotyk commented 2 months ago

What is your validation loss on the last checkpoint? It is encoded in to the checkpoint file name. I am training 44100 for an almost a week already and loss still goes down.

Jon-Zbw commented 2 months ago

Training Loss, Generated Outputs.

I hope this will be a reference for model training.

https://api.wandb.ai/links/xi-speech-team/k0kdfwch

TKS for your work,could your share 32k model training detail like: your encodec model(i found pretrained models :24k and 48k,so i guess 32k resample to 24k or 48k for encodec pretrained model,then resample to 32k ??)

LEECHOONGHO commented 2 months ago

Training Loss, Generated Outputs. I hope this will be a reference for model training. https://api.wandb.ai/links/xi-speech-team/k0kdfwch

TKS for your work,could your share 32k model training detail like: your encodec model(i found pretrained models :24k and 48k,so i guess 32k resample to 24k or 48k for encodec pretrained model,then resample to 32k ??)

I'm sry for your confuse. I just trained Mel Vocoder not for encodec's decoder.

But I have plans to train Mel-Encodec?(Mel Spectrogram to RVQ Encoder, and Vocos Decoder for Various Speech data) in the future.

LEECHOONGHO commented 2 months ago

Do you have a standard tensorboard logs? It is interesting to compare.

What is your validation loss on the last checkpoint? It is encoded in to the checkpoint file name. I am training 44100 for an almost a week already and loss still goes down.

I estimated mel loss, and Generator loss with newly gained dataset. and each was 0.0942 and 2.82. Because of the dataset's Size, estimating Eval loss with eval dataset have no difference with sampled train data.

how about your model output's quality? any artifacts?

patriotyk commented 2 months ago

Do you have a standard tensorboard logs? It is interesting to compare.

What is your validation loss on the last checkpoint? It is encoded in to the checkpoint file name. I am training 44100 for an almost a week already and loss still goes down.

I estimated mel loss, and Generator loss with newly gained dataset. and each was 0.0942 and 2.82. Because of the dataset's Size, estimating Eval loss with eval dataset have no difference with sampled train data.

how about your model output's quality? any artifacts?

I am still training(third week). It is very slow. I will update with my results when finish.

Mahmoud-ghareeb commented 2 months ago

how much data do we need for training

patriotyk commented 1 month ago

@LEECHOONGHO I have published my model here https://huggingface.co/patriotyk/vocos-mel-hifigan-compat-44100khz Sounds great, and there is metrics. @Mahmoud-ghareeb My model has been trained on 800+ hours of audio. Vocoder doesn't require text transcripts so you can easily use audio books for training. You even don't need to cut it by silence because vocos anyway internally splits provided audios to smaller segments.

Mahmoud-ghareeb commented 1 month ago

Great work! @patriotyk, Thank you so much

bzp83 commented 3 weeks ago

@LEECHOONGHO I have published my model here https://huggingface.co/patriotyk/vocos-mel-hifigan-compat-44100khz Sounds great, and there is metrics. @Mahmoud-ghareeb My model has been trained on 800+ hours of audio. Vocoder doesn't require text transcripts so you can easily use audio books for training. You even don't need to cut it by silence because vocos anyway internally splits provided audios to smaller segments.

I'm new to this... Could you please tell me what's the purpose of sharing the model? I mean, when I try to use it with a wav file, the output is very close to the original input file... So I'm confused here.

Thank you

patriotyk commented 3 weeks ago

This model generates audio from mel spectrograms. The functionality that you tried just generates mel from audio and then back audio from mel. But real tts systmes generate mels directly from text then vocoder generates audio.

bzp83 commented 3 weeks ago

Ah ok so generating mel from audio is different from what tts systems do? Is there any code snippet that would let me test the model you trained (ans possibly others)? Thank you!