Stream inference latency and voice quality are not good enough

steven8274 commented 3 months ago

Describe the bug Use the 'Inference_streaming' branch, the first audio chunk returned too late (over 2 seconds since tts text send to inference interface 'inference_sft') . The 2 seconds latency is too high for realtime comunication.I use CosyVoice in 'stt + llm + tts' chain.Now, the latencies in stt and llm are acceptible, only the tts latency is not low enough.Besides, when I use CosyVoice streaming tts, the voice quality decreased.There're some overlapped voices.For achieving lower stream tts latency,I changed these two configurations to half of their original values(in 'cosyvoice/cli/model.py'):

        self.token_min_hop_len = 50 #100
        self.token_max_hop_len = 200 #400

However, the lantency is not decreased obviously but voice quality gets worse(more obvious overlapped voice).

To Reproduce Steps to reproduce the behavior:

use 'inference_sft' in stream mode

Expected behavior The first audio chunk returned in an quite short time(under 300ms or 500ms is expected) and there is no overlapped voice can be heard.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Ubuntu
Version 20.04

aluminumbox commented 3 months ago

the latency is due to gradio audio cache. flow matching is not streaming model, so the overlap voice is inevitable. we are also trying to solve it.

steven8274 commented 3 months ago

the latency is due to gradio audio cache. flow matching is not streaming model, so the overlap voice is inevitable. we are also trying to solve it.

The gradio audio component also introduce audio delay, but I the delay I said is not that one.I print the time before TTS, and the time when the first audio chunk generated.The time diff is about 2 seconds.For streaming TTS used in realtime comunication, 2 seconds delay is not acceptible.

Zigars commented 3 months ago

yes,i meet the same problem, 2s latency can not for my realtme cmunication.

huskyachao commented 3 months ago

Did anyone compare the streaming mode and non-streaming mode? I found that the RTF (Real-time Factor = consuming_time / audio_len) of the streaming mode (1.5) is larger than the non-streaming mode(1.3). I wonder if this is the expected result for the RTF between the streaming/non-streaming mode.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.

FunAudioLLM / CosyVoice

Stream inference latency and voice quality are not good enough #294