Open huskyachao opened 3 weeks ago
do not freeze encoder, reduce lr or increase batch size
I compared the flow model config of cosyvoice.yaml and cosyvoice.fromscratch.yaml; the former is used to fine-tune on the pretrained model, and the latter is used to train from scratch on small dataset. I found that the small model (trained on small dataset) config compared to the big model (the open source one configed as the cosyvoice.yaml), the parameter reduction mainly occurs on the Conformer Encoder part, decrease from 6 block to 3 bloks, but on the decoder part, the ConditionalDecoder [worked as ODE estimator ] config, it only reduce 4 blocks of mid_blocks; Why there keeps heavy paremeter config on the ODE estimator? If I want to further reduce the model parameters, where can I cut more?
关于flow模型的 length_regulator, 没有太明白一些细节。 理解上是由于speech_token抽取的时候,S3以50的码率编码,一个码代表了320个样点; mel谱是用256的步长编码,一个梅尔帧256个样点,16K数据的帧率就是62.5; 现在flow就是要将50码率的东西,转为62.5码率的东西; length_regulator 就是对50的序列内部插值,插值到62.5的码率下长度。感觉length_regulator 里一堆卷积变换的,就是对硬插值后得到的结果做一下平滑。但是length_regulator的inference里,又是分频带又是组合的,这段逻辑在做什么啊?
Hey, guys. I am trying to train the flow model from scratch recently.
But I am a bit confused about the training pipeline of the flow model. As suggested by #281 @aluminumbox, the flow can be trained just change the param
model
in the run.sh script to 'flow', which will initalize the whole flow model, e.g.,flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
in the cosyvoice.yaml. I am confused that the loss function of the flow only take the loss ofcosyvoice.flow.flow.MaskedDiffWithXvec.decoder
into consideration without any from other modules of the flow (e.g., encoder, length_regulator).Question:
cosyvoice.flow.flow.MaskedDiffWithXvec.decoder
) of the flow model? Should we freeze partial of the flow (e.g., the encoder, length_regulator, and nn.Embedding()) to train the flow model?PS: the reason why I ask this question is that, I found the training of the whole flow model was unstable when initializing the whole flow model. But the training is stable when I only initialized the decoder but freezed other modules by
Here are the loss curves when without (The first one) and with (the second one) freezing. From the first one, it can be observed that there is a crash when the step is around 15k (the generated audio is also worse when steps>15k. But the second one is more stable with the same setting except the freezed module.