High latency before the first chunk returned in streaming mode

FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.

https://funaudiollm.github.io/

Apache License 2.0

6.59k stars 707 forks source link

High latency before the first chunk returned in streaming mode #367

Open huskyachao opened 2 months ago

huskyachao commented 2 months ago

Hi, I found that after updating the code, the latency of the first chunk in streaming mode (inference_zero_shot) is still very high (around 3s-4s). I noticed that issue #294 also mentioned this problem before the updating. Is this normal for the current version of cosyvoice? If it is, is there any method to downgrade this latency? Because such latency can make real-time communication of cosyvoice very difficult.

aluminumbox commented 2 months ago

yes first chunk inference is slower due to there is no kv cache, also tn takes some time

huskyachao commented 2 months ago

yes first chunk inference is slower due to there is no kv cache, also tn takes some time

OK, I see. Thanks for your answer.

MithrilMan commented 1 month ago

@aluminumbox I don't know internal details so this may be a stupid question, but in an use case where the voice is just one (e.g. a local personal assistant), can this be improved by preloading kv cache, or is that cache related to the sentence being spoken and so can't be preloaded and reused every time?