Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.8k stars 1.83k forks source link

Text cached without audio chunks return when doing text stream tts #2542

Open steven8274 opened 1 month ago

steven8274 commented 1 month ago

IN ORDER TO ASSIST YOU, PLEASE PROVIDE THE FOLLOWING:

def speech_synthesizer_synthesizing_cb(evt: speechsdk.SessionEventArgs): now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") print(f'Synthesizing event[{now}]:') print('\tAudioData: {} bytes'.format(len(evt.result.audio_data)))

client = OpenAI()

setup speech synthesizer

IMPORTANT: MUST use the websocket v2 endpoint

speech_config = speechsdk.SpeechConfig(endpoint=f"wss://eastasia.tts.speech.microsoft.com/cognitiveservices/websocket/v2", subscription='MY_SPEECH_SDK_KEY') speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm)

set a voice name

speech_config.speech_synthesis_voice_name = "zh-CN-XiaoxiaoMultilingualNeural"

speech_config.speech_synthesis_voice_name = "zh-CN-XiaoxiaoNeural"

set timeout value to bigger ones to avoid sdk cancel the request when GPT latency too high

properties = dict() properties["SpeechSynthesis_FrameTimeoutInterval"]="100000000" properties["SpeechSynthesis_RtfTimeoutThreshold"]="10" speech_config.set_properties_by_name(properties)

speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "azure_speeck_sdk.log")

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None) speech_synthesizer.synthesizing.connect(speech_synthesizer_synthesizing_cb)

create request with TextStream input type

tts_request = speechsdk.SpeechSynthesisRequest(input_type = speechsdk.SpeechSynthesisRequestInputType.TextStream) tts_task = speech_synthesizer.speak_async(tts_request)

completion = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a medical assistant, skilled in introducing medical concepts."}, {"role": "user", "content": "请简单说一下糖尿病的相关概念。不超过30字。"}, ], stream=True )

for chunk in completion:

print(chunk)

chunk_text = chunk.choices[0].delta.content
if chunk_text:
    tts_request.input_stream.write(chunk_text)
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
    print(f"TTS text written: {chunk_text}, at {now}", end="\n")

now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") print(f"[GPT END]{now}", end="\n")

time.sleep(30) print("Sleep 30 seconds completed")

close tts input stream when GPT finished

tts_request.input_stream.close()

wait all tts audio bytes return

result = tts_task.get()

print("[TTS END]", end="\n")



- If relevant, a WAV file of your input audio.

- Additional information as shown below

**Describe the bug**
If the text stream is not long enough (such as 30 Chinese characters), the first few pushed text chunks will never(at least in 30 seconds) trigger the return of any audio chunks before the call of 'tts_request.input_stream.close()'.This make tts streaming delay very large.

**To Reproduce**
Steps to reproduce the behavior:
1. Run the demo codes

**Expected behavior**
Only a short time after the first text chunk send to TTS service, the first audio chunks is returned.

**Version of the Cognitive Services Speech SDK**
azure-cognitiveservices-speech 1.40.0

**Platform, Operating System, and Programming Language**
Windows 10, Python 3.8.19

 - OS: Windows 10
 - Hardware - x64
 - Programming language: Python
 - Browser [e.g. Chrome, Safari] (if applicable) - please be specific

**Additional context**

 - Error messages, stack trace, ...
 - Any additional information.
niuzheng168 commented 1 month ago

TTS will judge when the buffer is long enough to speak. This is a known issue that on short sentence, the text stream mode is not that fast. We will improve later

steven8274 commented 1 month ago

TTS will judge when the buffer is long enough to speak. This is a known issue that on short sentence, the text stream mode is not that fast. We will improve later

Thank you for your response! Maybe we can make the text buffer threshold to begin TTS configurable. This problem causes the first audio chunk to be returned very late when the large model's answer is relatively short, causing streaming to fail to achieve the purpose of shortening latency.If I cache llm response text chunks and break the cached text into sentences, and I use none streaming TTS to translate these sentences into audio, these process make the latency shorter.

github-actions[bot] commented 1 week ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.