FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
4.76k stars 481 forks source link

About the training of the flow model #308

Open huskyachao opened 3 weeks ago

huskyachao commented 3 weeks ago

Hey, guys. I am trying to train the flow model from scratch recently.

But I am a bit confused about the training pipeline of the flow model. As suggested by #281 @aluminumbox, the flow can be trained just change the param model in the run.sh script to 'flow', which will initalize the whole flow model, e.g., flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec in the cosyvoice.yaml. I am confused that the loss function of the flow only take the loss of cosyvoice.flow.flow.MaskedDiffWithXvec.decoder into consideration without any from other modules of the flow (e.g., encoder, length_regulator).

Question:

  1. Is this manner suitable for the flow training considering that the loss is mostly related to the decoder (cosyvoice.flow.flow.MaskedDiffWithXvec.decoder ) of the flow model? Should we freeze partial of the flow (e.g., the encoder, length_regulator, and nn.Embedding()) to train the flow model?
  2. How did parameters of other modules (except the decoder) of the flow update if we initialize the whole flow model?

PS: the reason why I ask this question is that, I found the training of the whole flow model was unstable when initializing the whole flow model. But the training is stable when I only initialized the decoder but freezed other modules by

model = configs['flow']
for param in model.parameters():
    param.requires_grad = False
for param in model.decoder.parameters():
    param.requires_grad = True
...
optimizer = optim.Adam(model.module.decoder.parameters(), **configs['train_conf']['optim_conf'])

Here are the loss curves when without (The first one) and with (the second one) freezing. From the first one, it can be observed that there is a crash when the step is around 15k (the generated audio is also worse when steps>15k. But the second one is more stable with the same setting except the freezed module. image image

aluminumbox commented 3 weeks ago

do not freeze encoder, reduce lr or increase batch size

JohnHerry commented 1 week ago

I compared the flow model config of cosyvoice.yaml and cosyvoice.fromscratch.yaml; the former is used to fine-tune on the pretrained model, and the latter is used to train from scratch on small dataset. I found that the small model (trained on small dataset) config compared to the big model (the open source one configed as the cosyvoice.yaml), the parameter reduction mainly occurs on the Conformer Encoder part, decrease from 6 block to 3 bloks, but on the decoder part, the ConditionalDecoder [worked as ODE estimator ] config, it only reduce 4 blocks of mid_blocks; Why there keeps heavy paremeter config on the ODE estimator? If I want to further reduce the model parameters, where can I cut more?

JohnHerry commented 2 days ago

关于flow模型的 length_regulator, 没有太明白一些细节。 理解上是由于speech_token抽取的时候,S3以50的码率编码,一个码代表了320个样点; mel谱是用256的步长编码,一个梅尔帧256个样点,16K数据的帧率就是62.5; 现在flow就是要将50码率的东西,转为62.5码率的东西; length_regulator 就是对50的序列内部插值,插值到62.5的码率下长度。感觉length_regulator 里一堆卷积变换的,就是对硬插值后得到的结果做一下平滑。但是length_regulator的inference里,又是分频带又是组合的,这段逻辑在做什么啊?