Certain voices not providing viseme durations as expected

Describe the bug

Certain TTS voices are not providing speechmarks with viseme timings. For example, all the Urdu Azure TTS voices provide word timings but do not provide viseme timings which is what we need.

To Reproduce

Use following code: // Speech synthesis word boundary event. public static void synthesisWordAndVisemeBoundaryEventAsync() throws InterruptedException, ExecutionException { // Creates an instance of a speech config with specified // subscription key and service region. Replace with your own subscription key // and service region (e.g., "westus"). // The default language is "en-us". SpeechConfig config = SpeechConfig.fromSubscription("subscription_key", "region"); config.setSpeechSynthesisLanguage("ur-PK"); config.setSpeechSynthesisVoiceName("ur-PK-UzmaNeural");

    // Creates a speech synthesizer with a null output stream.
    // This means the audio output data will not be written to any stream.
    // You can just get the audio from the result.
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
    {
        String text = "<speak version=\"1.0\" xml:lang=\"en-US\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\"><voice name=\"ur-PK-UzmaNeural\"><prosody rate=\"1\">سنو ذرا</prosody></voice></speak>";

        // Subscribes to word boundary event
        synthesizer.WordBoundary.addEventListener((o, e) -> {

            try {
            String word = text.substring((int) e.getTextOffset(), (int) (e.getTextOffset() + e.getWordLength()));
            // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
            System.out.print("Word boundary event received. Audio offset: " + (e.getAudioOffset() + 5000) / 10000 + "ms, ");
            System.out.println("text offset: " + e.getTextOffset() + ", word length: " + e.getWordLength() + ", word = " + word + ".");
            } catch (Exception ex) {
                System.out.println("Exception = " + ex);
                ex.printStackTrace();
            }
        });

        // Subscribes to viseme boundary event
        synthesizer.VisemeReceived.addEventListener((o, e) -> {
            // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
            System.out.print("Viseme event received. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
            System.out.println("viseme id: " + e.getVisemeId() + ".");
        });

        // Subscribes to bookmark received event
        synthesizer.BookmarkReached.addEventListener((o, e) -> {
            // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
            System.out.print("Bookmark reached. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
            System.out.println("bookmark text: " + e.getText() + ".");
        });

        SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(text).get();

        // Checks result.
        if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
            System.out.println("Speech synthesized for text [" + text + "].");
            byte[] audioData = result.getAudioData();
            System.out.println(audioData.length + " bytes of audio data received for text [" + text + "]");
        }
        else if (result.getReason() == ResultReason.Canceled) {
            SpeechSynthesisCancellationDetails cancellation = SpeechSynthesisCancellationDetails.fromResult(result);
            System.out.println("CANCELED: Reason=" + cancellation.getReason());

            if (cancellation.getReason() == CancellationReason.Error) {
                System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
                System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
                System.out.println("CANCELED: Did you update the subscription info?");
            }
        }

        result.close();
    }

    synthesizer.close();
}

It outputs following:

Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Word boundary event received. Audio offset: 50ms, Viseme event received. Audio offset: 50ms, viseme id: 0. text offset: 175, word length: 3, word = سنو. Viseme event received. Audio offset: 50ms, viseme id: 0. Word boundary event received. Audio offset: 463ms, text offset: 179, word length: 3, word = ذرا. Speech synthesized for text [سنو ذرا]. 55342 bytes of audio data received for text [سنو ذرا]

Expected behavior

That we should receive viseme durations for all the visemes that make up the spoken words.

Version of the Cognitive Services Speech SDK

1.21.0

Platform, Operating System, and Programming Language

OS: Mac, Windows, Linux, Android, iOS
Hardware: x64, x86, ARM
Programming language: Java
Browsers: Chrome, Safari, Firefox

Azure-Samples / cognitive-services-speech-sdk

Certain voices not providing viseme durations as expected #2416