Certain TTS voices are not providing speechmarks with viseme timings. For example, all the Urdu Azure TTS voices provide word timings but do not provide viseme timings which is what we need.
To Reproduce
Use following code:
// Speech synthesis word boundary event.
public static void synthesisWordAndVisemeBoundaryEventAsync() throws InterruptedException, ExecutionException
{
// Creates an instance of a speech config with specified
// subscription key and service region. Replace with your own subscription key
// and service region (e.g., "westus").
// The default language is "en-us".
SpeechConfig config = SpeechConfig.fromSubscription("subscription_key", "region");
config.setSpeechSynthesisLanguage("ur-PK");
config.setSpeechSynthesisVoiceName("ur-PK-UzmaNeural");
// Creates a speech synthesizer with a null output stream.
// This means the audio output data will not be written to any stream.
// You can just get the audio from the result.
SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
{
String text = "<speak version=\"1.0\" xml:lang=\"en-US\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\"><voice name=\"ur-PK-UzmaNeural\"><prosody rate=\"1\">سنو ذرا</prosody></voice></speak>";
// Subscribes to word boundary event
synthesizer.WordBoundary.addEventListener((o, e) -> {
try {
String word = text.substring((int) e.getTextOffset(), (int) (e.getTextOffset() + e.getWordLength()));
// The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
System.out.print("Word boundary event received. Audio offset: " + (e.getAudioOffset() + 5000) / 10000 + "ms, ");
System.out.println("text offset: " + e.getTextOffset() + ", word length: " + e.getWordLength() + ", word = " + word + ".");
} catch (Exception ex) {
System.out.println("Exception = " + ex);
ex.printStackTrace();
}
});
// Subscribes to viseme boundary event
synthesizer.VisemeReceived.addEventListener((o, e) -> {
// The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
System.out.print("Viseme event received. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
System.out.println("viseme id: " + e.getVisemeId() + ".");
});
// Subscribes to bookmark received event
synthesizer.BookmarkReached.addEventListener((o, e) -> {
// The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
System.out.print("Bookmark reached. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
System.out.println("bookmark text: " + e.getText() + ".");
});
SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(text).get();
// Checks result.
if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
System.out.println("Speech synthesized for text [" + text + "].");
byte[] audioData = result.getAudioData();
System.out.println(audioData.length + " bytes of audio data received for text [" + text + "]");
}
else if (result.getReason() == ResultReason.Canceled) {
SpeechSynthesisCancellationDetails cancellation = SpeechSynthesisCancellationDetails.fromResult(result);
System.out.println("CANCELED: Reason=" + cancellation.getReason());
if (cancellation.getReason() == CancellationReason.Error) {
System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
System.out.println("CANCELED: Did you update the subscription info?");
}
}
result.close();
}
synthesizer.close();
}
It outputs following:
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Word boundary event received. Audio offset: 50ms, Viseme event received. Audio offset: 50ms, viseme id: 0.
text offset: 175, word length: 3, word = سنو.
Viseme event received. Audio offset: 50ms, viseme id: 0.
Word boundary event received. Audio offset: 463ms, text offset: 179, word length: 3, word = ذرا.
Speech synthesized for text [سنو ذرا].
55342 bytes of audio data received for text [سنو ذرا]
Expected behavior
That we should receive viseme durations for all the visemes that make up the spoken words.
Version of the Cognitive Services Speech SDK
1.21.0
Platform, Operating System, and Programming Language
Describe the bug
Certain TTS voices are not providing speechmarks with viseme timings. For example, all the Urdu Azure TTS voices provide word timings but do not provide viseme timings which is what we need.
To Reproduce
Use following code: // Speech synthesis word boundary event. public static void synthesisWordAndVisemeBoundaryEventAsync() throws InterruptedException, ExecutionException { // Creates an instance of a speech config with specified // subscription key and service region. Replace with your own subscription key // and service region (e.g., "westus"). // The default language is "en-us". SpeechConfig config = SpeechConfig.fromSubscription("subscription_key", "region"); config.setSpeechSynthesisLanguage("ur-PK"); config.setSpeechSynthesisVoiceName("ur-PK-UzmaNeural");
It outputs following:
Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Word boundary event received. Audio offset: 50ms, Viseme event received. Audio offset: 50ms, viseme id: 0. text offset: 175, word length: 3, word = سنو. Viseme event received. Audio offset: 50ms, viseme id: 0. Word boundary event received. Audio offset: 463ms, text offset: 179, word length: 3, word = ذرا. Speech synthesized for text [سنو ذرا ].
55342 bytes of audio data received for text [سنو ذرا ]
Expected behavior
That we should receive viseme durations for all the visemes that make up the spoken words.
Version of the Cognitive Services Speech SDK
1.21.0
Platform, Operating System, and Programming Language