Open steven8274 opened 3 months ago
TTS will judge when the buffer is long enough to speak. This is a known issue that on short sentence, the text stream mode is not that fast. We will improve later
TTS will judge when the buffer is long enough to speak. This is a known issue that on short sentence, the text stream mode is not that fast. We will improve later
Thank you for your response! Maybe we can make the text buffer threshold to begin TTS configurable. This problem causes the first audio chunk to be returned very late when the large model's answer is relatively short, causing streaming to fail to achieve the purpose of shortening latency.If I cache llm response text chunks and break the cached text into sentences, and I use none streaming TTS to translate these sentences into audio, these process make the latency shorter.
This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.
IN ORDER TO ASSIST YOU, PLEASE PROVIDE THE FOLLOWING:
Speech SDK log taken from a run that exhibits the reported issue. azure_speeck_sdk.zip
A stripped down, simplified version of your source code that exhibits the issue. Or, preferably, try to reproduce the problem with one of the public samples in this repository (or a minimally modified version of it), and share the code.
def speech_synthesizer_synthesizing_cb(evt: speechsdk.SessionEventArgs): now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") print(f'Synthesizing event[{now}]:') print('\tAudioData: {} bytes'.format(len(evt.result.audio_data)))
client = OpenAI()
setup speech synthesizer
IMPORTANT: MUST use the websocket v2 endpoint
speech_config = speechsdk.SpeechConfig(endpoint=f"wss://eastasia.tts.speech.microsoft.com/cognitiveservices/websocket/v2", subscription='MY_SPEECH_SDK_KEY') speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm)
set a voice name
speech_config.speech_synthesis_voice_name = "zh-CN-XiaoxiaoMultilingualNeural"
speech_config.speech_synthesis_voice_name = "zh-CN-XiaoxiaoNeural"
set timeout value to bigger ones to avoid sdk cancel the request when GPT latency too high
properties = dict() properties["SpeechSynthesis_FrameTimeoutInterval"]="100000000" properties["SpeechSynthesis_RtfTimeoutThreshold"]="10" speech_config.set_properties_by_name(properties)
speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "azure_speeck_sdk.log")
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None) speech_synthesizer.synthesizing.connect(speech_synthesizer_synthesizing_cb)
create request with TextStream input type
tts_request = speechsdk.SpeechSynthesisRequest(input_type = speechsdk.SpeechSynthesisRequestInputType.TextStream) tts_task = speech_synthesizer.speak_async(tts_request)
completion = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a medical assistant, skilled in introducing medical concepts."}, {"role": "user", "content": "请简单说一下糖尿病的相关概念。不超过30字。"}, ], stream=True )
for chunk in completion:
print(chunk)
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") print(f"[GPT END]{now}", end="\n")
time.sleep(30) print("Sleep 30 seconds completed")
close tts input stream when GPT finished
tts_request.input_stream.close()
wait all tts audio bytes return
result = tts_task.get()
print("[TTS END]", end="\n")