Closed JeffreyLam2 closed 1 year ago
Hi @JeffreyLam2, thanks for using azure speech. Unfortunately, the word boundary event is designed to be triggered before the corresponding word (but not just at the beginning of that word). The word boundary event has an audio_offset
property for alignment with the audio.
@eric-urban do you think we need to improve the doc? This event is raised at the beginning of each new spoken word
Hi @yulin-li , thanks a lot for the response. I was making a real time subtitles for the speech. I think the design of word boundary event is a bit weird. If it is before the word at an uncertain time, why don't we send all information at the start event? With the uncertain time of triggering word boundary event, even with the audio_offset. It is hard to capture when tts speaks certain word. For the bookmark event it is also the same behavior.
Apart from that, I face a problem that sometimes few words of the end of the last sentence haven't send word boundary event. Complete event also missing from time to time. I will try to collect more info and create new ticket.
Found the reason of missing word boundary events. It only happens when the same speech_synthesizer perform speak multiple times. Not gonna create a new ticket for it.
Solution: After every call, create a new speech_synthesizer again. Eg. Run these codes every time for a new speak ` speech_config = speechsdk.SpeechConfig(subscription=settings.SPEECH_KEY, region=settings.SPEECH_REGION) speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "log/speechLog.log")
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
speech_synthesizer.synthesis_word_boundary.connect(self._speech_synthesizer_word_boundary_cb)
result = speech_synthesizer.speak_ssml_async(ssml_string).get()
`
@yulin-li Is the above as intended or a bug? ("missing word boundary events" "when the same speech_synthesizer perform speak multiple times")
@JeffreyLam2 sorry for missing your latest comments. For the issue "missing word boundary events", I think you need to wait for the synthesis_completed
event fired.
a sample code
finished = []
def completed_callback(evt):
nonlocal finished
finished .append(True)
synthesizer.synthesis_completed.connect(completed_callback)
while len(completed_callback) < 1:
time.sleep(0.02)
Closed as resolved, please open a new issue if more support is needed.
Describe the bug In the document, it is stated that "This event is raised at the beginning of each new spoken word". However, it fires all the word event at the begining.
document reference: [https://learn.microsoft.com/en-gb/azure/cognitive-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python]
To Reproduce Steps to reproduce the behavior:
def speech_synthesizer_word_boundary_cb(evt: speechsdk.SessionEventArgs): print(f'Word Time: {time.time()} with word {evt.text}')
print(f"start time: {time.time()}") speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION')) speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "speechLog.log") audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True) speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config) speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_cb) text = "You can replace this text with any text you wish. You can either write in this text box or paste your own text here.Try different languages and voices. Change the speed and the pitch of the voice. You can even tweak the SSML (Speech Synthesis Markup Language) to control how the different sections of the text sound. Click on SSML above to give it a try! Enjoy using Text to Speech!" speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
`
Word Time: 1675793936.3902175 with word using Word Time: 1675793936.3912194 with word Text Word Time: 1675793936.392219 with word to Word Time: 1675793936.392219 with word Speech Word Time: 1675793936.3932173 with word ! ` The speech text takes total around 30 seconds to speak. But all the WordBoundary event returns in around 1 second of time.
Expected behavior The WordBoundary event should trigger at the start of each new word. It should takes around 30 seconds to send all the events in my case.
Version of the Cognitive Services Speech SDK azure-cognitiveservices-speech-1.25.0
Platform, Operating System, and Programming Language
Additional context