FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
4.18k stars 401 forks source link

about Cosy Voice #137

Open Ozawakun97 opened 1 month ago

Ozawakun97 commented 1 month ago

In the generation process of Cosy Voice a flow matching module is employed to convert Speech Tokens to Mel Spectrum image my question is why is flow model employed instead of diffusion model?

aluminumbox commented 1 month ago

well maybe @ZhihaoDU can answer this question, have you done some ablation study?

ZhihaoDU commented 1 month ago

In the generation process of Cosy Voice a flow matching module is employed to convert Speech Tokens to Mel Spectrum image my question is why is flow model employed instead of diffusion model?

In general, flow matching models require much less iterations (5-10 steps) than denoising diffusion models to obtain a satisfactory Mel spectrum. We choose flow matching rather than denoising diffusion models for efficiency reasons.

Ozawakun97 commented 1 month ago

In the generation process of Cosy Voice a flow matching module is employed to convert Speech Tokens to Mel Spectrum image my question is why is flow model employed instead of diffusion model?

In general, flow matching models require much less iterations (5-10 steps) than denoising diffusion models to obtain a satisfactory Mel spectrum. We choose flow matching rather than denoising diffusion models for efficiency reasons.

Thanks for answering