Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.97k stars 1.87k forks source link

Python SDK text-to-speech WordBoundary event is sending out at beginnning instead of at new word #1826

Closed JeffreyLam2 closed 1 year ago

JeffreyLam2 commented 1 year ago

Describe the bug In the document, it is stated that "This event is raised at the beginning of each new spoken word". However, it fires all the word event at the begining.

document reference: [https://learn.microsoft.com/en-gb/azure/cognitive-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python]

To Reproduce Steps to reproduce the behavior:

  1. ...Define code ` import time import azure.cognitiveservices.speech as speechsdk

def speech_synthesizer_word_boundary_cb(evt: speechsdk.SessionEventArgs): print(f'Word Time: {time.time()} with word {evt.text}')

print(f"start time: {time.time()}") speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION')) speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "speechLog.log") audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True) speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config) speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_cb) text = "You can replace this text with any text you wish. You can either write in this text box or paste your own text here.Try different languages and voices. Change the speed and the pitch of the voice. You can even tweak the SSML (Speech Synthesis Markup Language) to control how the different sections of the text sound. Click on SSML above to give it a try! Enjoy using Text to Speech!" speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

`

  1. Run the code
  2. Received unexpected result: ` start time: 1675793936.0649853 Word Time: 1675793936.3229887 with word You Word Time: 1675793936.3239863 with word can ... Word Time: 1675793936.3882215 with word Enjoy

Word Time: 1675793936.3902175 with word using Word Time: 1675793936.3912194 with word Text Word Time: 1675793936.392219 with word to Word Time: 1675793936.392219 with word Speech Word Time: 1675793936.3932173 with word ! ` The speech text takes total around 30 seconds to speak. But all the WordBoundary event returns in around 1 second of time.

Expected behavior The WordBoundary event should trigger at the start of each new word. It should takes around 30 seconds to send all the events in my case.

Version of the Cognitive Services Speech SDK azure-cognitiveservices-speech-1.25.0

Platform, Operating System, and Programming Language

Additional context

yulin-li commented 1 year ago

Hi @JeffreyLam2, thanks for using azure speech. Unfortunately, the word boundary event is designed to be triggered before the corresponding word (but not just at the beginning of that word). The word boundary event has an audio_offset property for alignment with the audio.

yulin-li commented 1 year ago

@eric-urban do you think we need to improve the doc? This event is raised at the beginning of each new spoken word

JeffreyLam2 commented 1 year ago

Hi @yulin-li , thanks a lot for the response. I was making a real time subtitles for the speech. I think the design of word boundary event is a bit weird. If it is before the word at an uncertain time, why don't we send all information at the start event? With the uncertain time of triggering word boundary event, even with the audio_offset. It is hard to capture when tts speaks certain word. For the bookmark event it is also the same behavior.

Apart from that, I face a problem that sometimes few words of the end of the last sentence haven't send word boundary event. Complete event also missing from time to time. I will try to collect more info and create new ticket.

JeffreyLam2 commented 1 year ago

Found the reason of missing word boundary events. It only happens when the same speech_synthesizer perform speak multiple times. Not gonna create a new ticket for it.

Solution: After every call, create a new speech_synthesizer again. Eg. Run these codes every time for a new speak ` speech_config = speechsdk.SpeechConfig(subscription=settings.SPEECH_KEY, region=settings.SPEECH_REGION) speech_config.set_property(speechsdk.PropertyId.Speech_LogFilename, "log/speechLog.log")

Speak via speaker

    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
    speech_synthesizer.synthesis_word_boundary.connect(self._speech_synthesizer_word_boundary_cb)
    result = speech_synthesizer.speak_ssml_async(ssml_string).get()

`

pankopon commented 1 year ago

@yulin-li Is the above as intended or a bug? ("missing word boundary events" "when the same speech_synthesizer perform speak multiple times")

yulin-li commented 1 year ago

@JeffreyLam2 sorry for missing your latest comments. For the issue "missing word boundary events", I think you need to wait for the synthesis_completed event fired.

yulin-li commented 1 year ago

a sample code

    finished = []

    def completed_callback(evt):
        nonlocal finished 
        finished .append(True)

    synthesizer.synthesis_completed.connect(completed_callback)

    while len(completed_callback) < 1:
        time.sleep(0.02)
pankopon commented 1 year ago

Closed as resolved, please open a new issue if more support is needed.