Open GJStevenson opened 2 months ago
Same happens for Java too
With Edge's Read Aloud, whether or not I'm using the multilingual versions of Andrew and Brian available there (which is a bit confusing as the ones that don't say "multilingual" in their names still act as such), it skips to the next sentence/passage every time it comes across those characters. Happens with Remy too.
@yulin-li Please check - a service side issue / voice model specific?
@yanchang-gyc to follow up
Not sure if this is the same issue, but the "en-US-SaraNeural" voice model will also report incorrect word boundary events after it encounters the letter "y". speech_synthesis_en-US-SaraNeural_20240614_102504.log
Enter some text that you want to synthesize, Ctrl-Z to exit
Select Yes or No from the drop down menu
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.462500, text_offset=0, word_length=6), audio offset in ms: 50.0ms. Text: Select
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=5250000, duration=0:00:00.575000, text_offset=7, word_length=3), audio offset in ms: 525.0ms. Text: Yes
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=11000000, duration=0:00:00.150000, text_offset=11, word_length=2), audio offset in ms: 1100.0ms. Text: or
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=12625000, duration=0:00:00.200000, text_offset=14, word_length=2), audio offset in ms: 1262.5ms. Text: No
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=14750000, duration=0:00:00.137500, text_offset=17, word_length=4), audio offset in ms: 1475.0ms. Text: from
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=16250000, duration=0:00:00.087500, text_offset=-1, word_length=3), audio offset in ms: 1625.0ms. Text: rom
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17250000, duration=0:00:00.275000, text_offset=22, word_length=5), audio offset in ms: 1725.0ms. Text: the d
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=20125000, duration=0:00:00.225000, text_offset=27, word_length=5), audio offset in ms: 2012.5ms. Text: rop d
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=22500000, duration=0:00:00.450000, text_offset=32, word_length=8), audio offset in ms: 2250.0ms. Text: own menu
Speech synthesized for text [Select Yes or No from the drop down menu]
115246 bytes of audio data received.
Enter some text that you want to synthesize, Ctrl-Z to exit
yes is another word causing issues
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.600000, text_offset=0, word_length=3), audio offset in ms: 50.0ms. Text: yes
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=6500000, duration=0:00:00.125000, text_offset=4, word_length=2), audio offset in ms: 650.0ms. Text: is
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=7875000, duration=0:00:00.350000, text_offset=7, word_length=7), audio offset in ms: 787.5ms. Text: another
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=11500000, duration=0:00:00.187500, text_offset=-1, word_length=5), audio offset in ms: 1150.0ms. Text: her w
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=13500000, duration=0:00:00.387500, text_offset=16, word_length=8), audio offset in ms: 1350.0ms. Text: ord caus
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17500000, duration=0:00:00.587500, text_offset=24, word_length=10), audio offset in ms: 1750.0ms. Text: ing issues
Speech synthesized for text [yes is another word causing issues]
103726 bytes of audio data received.
Describe the bug
A subset of the voice models appear to have difficulty processing the three special characters:
<
>
and&
even when using entity format (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters). After a special character is present in the script, the WordBoundary events will begin to report incorrect word boundaries.A non-exhaustive list of voice models that appear to be exhibiting this behavior are:
en-US-AndrewNeural en-US-BrianNeural en-US-EmmaNeural en-US-JennyMultilingualNeural en-US-RyanMultilingualNeural
I've experienced this issue with the Javascript SDK, as well as the Python SDK. Sample code using the Python sample project here: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3
To Reproduce
conda env create -f environment.yml
and then activate the environment.speech_key
andservice_region
speech_synthesis_word_boundary_event
.python speech_synthesis_sample.py
and enter your sample text.NOTE:
speak_text_async
appears to handle converting the special characters to html entities automaticallyView the results in the console, and where the logs are emitted (
./out/
)Testing AT&T to see if it works
will emit the word boundary events:After the
&
is encountered, the word boundary events start reporting incorrect word boundaries (AT&a
,mp;
,T to
, etc.). This issue also exists with the other two special characters<
and>
Attached are some logs from running the input string
Testing AT&T to see if it works
against the voice modelsen-US-AndrewNeural
anden-US-AriaNeural
Expected behavior
Word boundaries are reported correctly regardless if the special characters exist.
Version of the Cognitive Services Speech SDK
Python 1.37.0 Javascript 1.31.0
Platform, Operating System, and Programming Language
Additional context
en-US-AndrewNeural Logs: speech_synthesis_en-US-AndrewNeural_20240430_163926.log
en-US-AriaNeural Logs: speech_synthesis_en-US-AriaNeural_20240430_165107.log