Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.68k stars 1.79k forks source link

Certain voice models emit incorrect word boundary events when processing special characters #2359

Open GJStevenson opened 2 months ago

GJStevenson commented 2 months ago

Describe the bug

A subset of the voice models appear to have difficulty processing the three special characters: < > and & even when using entity format (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters). After a special character is present in the script, the WordBoundary events will begin to report incorrect word boundaries.

A non-exhaustive list of voice models that appear to be exhibiting this behavior are:

en-US-AndrewNeural en-US-BrianNeural en-US-EmmaNeural en-US-JennyMultilingualNeural en-US-RyanMultilingualNeural

I've experienced this issue with the Javascript SDK, as well as the Python SDK. Sample code using the Python sample project here: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3

To Reproduce

  1. Pull down sample code in gist: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3
  2. Install dependencies listed in environment.yml
    • If using conda, run: conda env create -f environment.yml and then activate the environment.
  3. Set speech_key and service_region
  4. Choose voice model to use in speech_synthesis_word_boundary_event.
  5. Run python speech_synthesis_sample.py and enter your sample text.

NOTE: speak_text_async appears to handle converting the special characters to html entities automatically

  1. View the results in the console, and where the logs are emitted (./out/)

    • For example, the sample text: Testing AT&T to see if it works will emit the word boundary events:
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.437500, text_offset=0, word_length=7), audio offset in ms: 50.0ms. Text: Testing
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=5000000, duration=0:00:00.962500, text_offset=-1, word_length=4), audio offset in ms: 500.0ms. Text: AT&a
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=14750000, duration=0:00:00.087500, text_offset=-1, word_length=3), audio offset in ms: 1475.0ms. Text: mp;
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=15750000, duration=0:00:00.200000, text_offset=11, word_length=4), audio offset in ms: 1575.0ms. Text: T to
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17875000, duration=0:00:00.112500, text_offset=16, word_length=2), audio offset in ms: 1787.5ms. Text: se
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=19125000, duration=0:00:00.087500, text_offset=18, word_length=3), audio offset in ms: 1912.5ms. Text: e i
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=20125000, duration=0:00:00.575000, text_offset=21, word_length=10), audio offset in ms: 2012.5ms. Text: f it works

After the &amp; is encountered, the word boundary events start reporting incorrect word boundaries (AT&a, mp;, T to, etc.). This issue also exists with the other two special characters < and >

Attached are some logs from running the input string Testing AT&T to see if it works against the voice models en-US-AndrewNeural and en-US-AriaNeural

Expected behavior

Word boundaries are reported correctly regardless if the special characters exist.

Version of the Cognitive Services Speech SDK

Python 1.37.0 Javascript 1.31.0

Platform, Operating System, and Programming Language

Additional context

en-US-AndrewNeural Logs: speech_synthesis_en-US-AndrewNeural_20240430_163926.log

en-US-AriaNeural Logs: speech_synthesis_en-US-AriaNeural_20240430_165107.log

meetakshay99 commented 2 months ago

Same happens for Java too

BeastBlood1885 commented 1 month ago

With Edge's Read Aloud, whether or not I'm using the multilingual versions of Andrew and Brian available there (which is a bit confusing as the ones that don't say "multilingual" in their names still act as such), it skips to the next sentence/passage every time it comes across those characters. Happens with Remy too.

pankopon commented 1 month ago

@yulin-li Please check - a service side issue / voice model specific?

Kerry-LinZhang commented 1 month ago

@yanchang-gyc to follow up

GJStevenson commented 2 weeks ago

Not sure if this is the same issue, but the "en-US-SaraNeural" voice model will also report incorrect word boundary events after it encounters the letter "y". speech_synthesis_en-US-SaraNeural_20240614_102504.log

Enter some text that you want to synthesize, Ctrl-Z to exit
Select Yes or No from the drop down menu
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.462500, text_offset=0, word_length=6), audio offset in ms: 50.0ms. Text: Select
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=5250000, duration=0:00:00.575000, text_offset=7, word_length=3), audio offset in ms: 525.0ms. Text: Yes
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=11000000, duration=0:00:00.150000, text_offset=11, word_length=2), audio offset in ms: 1100.0ms. Text: or
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=12625000, duration=0:00:00.200000, text_offset=14, word_length=2), audio offset in ms: 1262.5ms. Text: No
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=14750000, duration=0:00:00.137500, text_offset=17, word_length=4), audio offset in ms: 1475.0ms. Text: from
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=16250000, duration=0:00:00.087500, text_offset=-1, word_length=3), audio offset in ms: 1625.0ms. Text: rom
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17250000, duration=0:00:00.275000, text_offset=22, word_length=5), audio offset in ms: 1725.0ms. Text: the d
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=20125000, duration=0:00:00.225000, text_offset=27, word_length=5), audio offset in ms: 2012.5ms. Text: rop d
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=22500000, duration=0:00:00.450000, text_offset=32, word_length=8), audio offset in ms: 2250.0ms. Text: own menu
Speech synthesized for text [Select Yes or No from the drop down menu]
115246 bytes of audio data received.
Enter some text that you want to synthesize, Ctrl-Z to exit
yes is another word causing issues
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.600000, text_offset=0, word_length=3), audio offset in ms: 50.0ms. Text: yes
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=6500000, duration=0:00:00.125000, text_offset=4, word_length=2), audio offset in ms: 650.0ms. Text: is
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=7875000, duration=0:00:00.350000, text_offset=7, word_length=7), audio offset in ms: 787.5ms. Text: another
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=11500000, duration=0:00:00.187500, text_offset=-1, word_length=5), audio offset in ms: 1150.0ms. Text: her w
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=13500000, duration=0:00:00.387500, text_offset=16, word_length=8), audio offset in ms: 1350.0ms. Text: ord caus
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17500000, duration=0:00:00.587500, text_offset=24, word_length=10), audio offset in ms: 1750.0ms. Text: ing issues
Speech synthesized for text [yes is another word causing issues]
103726 bytes of audio data received.