Text to Speech SSML, speed, volume, related features. Lack of documentation, loss of line breaks, possible abberant prosody, nuances.

mirage335 commented 5 years ago

Joystick Gremlin documentation on the Text to Speech feature does not mention support for Speech Synthesis Markup Language (SSML), though this is indeed supported as the only apparent method to change speed, volume, or other characteristics. https://whitemagic.github.io/JoystickGremlin/interface/ https://www.google.com/search?q="Joystick+Gremlin"+"Speech+Synthesis+Markup+Language"

_

Joystick Gremlin save/load profile feature does not preserve line break white space in Text to Speech blocks. White space cannot be used as pauses, and SSML is crushed to a barely readable one-line string.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
<prosody volume="medium" rate="medium">same line test <break/> different line
test
same line test</prosody>
</speak>

... becomes ...

<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"          xsi:schemaLocation="http://www.w3.org/2001/10/synthesis                    http://www.w3.org/TR/speech-synthesis/synthesis.xsd"          xml:lang="en-US"> <prosody volume="medium" rate="medium">same line test <break/> different line test same line test</prosody> </speak>

With predictable consequences.

_

Prosody element rate specification handling behavior is at least different from VoiceAttack, and different from some specifications.

Specifying percentage rates for prosody, instead of "x-slow", "medium", etc, may be intermittently unreliable.
Rates from 1%-100% are apparently meaningful, instead of 20%-300%.
An equivalent percentage rate to "x-slow", perhaps "1%" or "20%", is not available.

Relevant specifications are available from W3 and Amazon. https://www.w3.org/TR/speech-synthesis11/#S3.2.4 https://developer.amazon.com/docs/custom-skills/speech-synthesis-markup-language-ssml-reference.html#prosody

A bit of SSML can be used to repeatedly test this behavior.

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
You have selected Microsoft Zira as the default voice.
<prosody rate="x-slow">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="slow">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="medium">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="fast">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="x-fast">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="1%">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="20%">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="100%">You have selected Microsoft Zira as the default voice.</prosody>
<prosody rate="300%">You have selected Microsoft Zira as the default voice.</prosody>
</speak>

Documenting the reason for this behavior would probably be sufficient, as opposed to achieving some sort of standards compliance. Given the large number of Text to Speech commands users may to configure with Joystick Gremlin, it is important to know their behavior will not change, although editing the XML with Find and Replace is a possible workaround in that scenario.

_

A couple more nuances could be included in the documentation.

Speech requests are executed serially. Long sentences will play back one after another, which can build up to endless speech.
Audio output from Text to Speech requests is through the Joystick Gremlin entry in the MSW system volume mixer. Consequently, it is possible to mute these with a VoiceAttack command adjusting the audio level for Joystick Gremlin specifically.

WhiteMagic commented 5 years ago

Gremlin doesn't do anything related to text-to-speech itself, it simply hands it off to Windows to do whatever it wants with that. I never looked into any of these things as I don't really use it that much myself. There are also other issues like certain voices not working from memory.

The issue with the XML snippets getting mangled up is simply due to those elements being considered plain text, as opposed to an XML document. The only way to really solve this would be to have Gremlin understand this and include the content entered as an actual XML document. However, that runs a major risk of the entire profile becoming unusable if the user makes an error when entering the XML document, as they aren't really meant to be human writable.

There are clearly a ton of improvements that could be made to the TTS action, though and some of these things look interesting. However, at this stage I can't tell when I will get to it, especially seeing as I haven't fixed bugs with other voices that I have known about for about two years or so.

mirage335 commented 5 years ago

Perhaps in the short term, the documentation could be updated to point out Text to Speech supports SSML? This snippet is probably a good example for documentation sake, as it works around most of the issues.

<?xml version="1.0"?><speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <prosody volume="medium" rate="fast">echo</prosody> </speak>

Text-to-Speech is an important feature. Many joysticks buttons do not provide adequate tactile feedback. Quick voice confirmation is nice too, especially with a variety of complicated layouts.

WhiteMagic / JoystickGremlin

Text to Speech SSML, speed, volume, related features. Lack of documentation, loss of line breaks, possible abberant prosody, nuances. #248