FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
5.12k stars 519 forks source link

why the length of the final chunk changes even when using the same input (streaming mode)? #433

Open huskyachao opened 2 days ago

huskyachao commented 2 days ago

Hi, I repeated the streaming generation several times with the same input but I found that the length of the final yielded chunk changes every time. As you can see below, the yield speech len of the final chunk across the three generated samples are different, i.e., 0.8475s, 2.3104s, and 2.5890s, respectively. Does anyone know why this happened?


100%|██████████| 1/1 [00:16<00:00, 16.68s/it]
100%|██████████| 1/1 [00:16<00:00, 16.68s/it]

2024-09-24 10:06:23,630 INFO synthesis text 我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力、
2024-09-24 10:06:23,630 INFO prompt speech len 3.0875
2024-09-24 10:06:25,734 INFO yield speech len 1.7647 | cost 2.1032 | rtf 1.1918 | init_delay 2.1025
2024-09-24 10:06:27,046 INFO yield speech len 1.9969 | cost 1.3117 | rtf 0.6568 | init_delay 2.1025
2024-09-24 10:06:28,371 INFO yield speech len 1.9969 | cost 1.3250 | rtf 0.6635 | init_delay 2.1025
2024-09-24 10:06:29,508 INFO yield speech len 1.9969 | cost 1.1365 | rtf 0.5692 | init_delay 2.1025
2024-09-24 10:06:30,058 INFO yield speech len 0.8475 | cost 0.5502 | rtf 0.6492 | init_delay 6.4269

100%|██████████| 1/1 [00:06<00:00,  6.51s/it]
100%|██████████| 1/1 [00:06<00:00,  6.51s/it]

2024-09-24 10:06:30,161 INFO synthesis text 我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力、
2024-09-24 10:06:30,161 INFO prompt speech len 3.0875
2024-09-24 10:06:32,260 INFO yield speech len 1.7647 | cost 2.0989 | rtf 1.1894 | init_delay 2.0981
2024-09-24 10:06:33,566 INFO yield speech len 1.9969 | cost 1.3063 | rtf 0.6542 | init_delay 2.0981
2024-09-24 10:06:34,890 INFO yield speech len 1.9969 | cost 1.3240 | rtf 0.6630 | init_delay 2.0981
2024-09-24 10:06:56,889 INFO yield speech len 2.3104 | cost 21.9980 | rtf 9.5214 | init_delay 26.7272

100%|██████████| 1/1 [00:26<00:00, 26.81s/it]
100%|██████████| 1/1 [00:26<00:00, 26.81s/it]

2024-09-24 10:06:56,995 INFO synthesis text 我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力、
2024-09-24 10:06:56,995 INFO prompt speech len 3.0875
2024-09-24 10:06:59,463 INFO yield speech len 1.7647 | cost 2.4683 | rtf 1.3987 | init_delay 2.4677
2024-09-24 10:07:00,469 INFO yield speech len 1.9969 | cost 1.0056 | rtf 0.5036 | init_delay 2.4677
2024-09-24 10:07:01,680 INFO yield speech len 1.9969 | cost 1.2103 | rtf 0.6061 | init_delay 2.4677
2024-09-24 10:07:13,595 INFO yield speech len 2.5890 | cost 11.9148 | rtf 4.6020 | init_delay 16.5990
aluminumbox commented 1 day ago

there is random function in llm sampling