Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.68k stars 1.79k forks source link

Certain voices not providing viseme durations as expected #2416

Open trulience opened 3 weeks ago

trulience commented 3 weeks ago

Describe the bug

Certain TTS voices are not providing speechmarks with viseme timings. For example, all the Urdu Azure TTS voices provide word timings but do not provide viseme timings which is what we need.

To Reproduce

Use following code: // Speech synthesis word boundary event. public static void synthesisWordAndVisemeBoundaryEventAsync() throws InterruptedException, ExecutionException { // Creates an instance of a speech config with specified // subscription key and service region. Replace with your own subscription key // and service region (e.g., "westus"). // The default language is "en-us". SpeechConfig config = SpeechConfig.fromSubscription("subscription_key", "region"); config.setSpeechSynthesisLanguage("ur-PK"); config.setSpeechSynthesisVoiceName("ur-PK-UzmaNeural");

    // Creates a speech synthesizer with a null output stream.
    // This means the audio output data will not be written to any stream.
    // You can just get the audio from the result.
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
    {
        String text = "<speak version=\"1.0\" xml:lang=\"en-US\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:mstts=\"https://www.w3.org/2001/mstts\"><voice name=\"ur-PK-UzmaNeural\"><prosody rate=\"1\">سنو ذرا</prosody></voice></speak>";

        // Subscribes to word boundary event
        synthesizer.WordBoundary.addEventListener((o, e) -> {

            try {
            String word = text.substring((int) e.getTextOffset(), (int) (e.getTextOffset() + e.getWordLength()));
            // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
            System.out.print("Word boundary event received. Audio offset: " + (e.getAudioOffset() + 5000) / 10000 + "ms, ");
            System.out.println("text offset: " + e.getTextOffset() + ", word length: " + e.getWordLength() + ", word = " + word + ".");
            } catch (Exception ex) {
                System.out.println("Exception = " + ex);
                ex.printStackTrace();
            }
        });

        // Subscribes to viseme boundary event
        synthesizer.VisemeReceived.addEventListener((o, e) -> {
            // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
            System.out.print("Viseme event received. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
            System.out.println("viseme id: " + e.getVisemeId() + ".");
        });

        // Subscribes to bookmark received event
        synthesizer.BookmarkReached.addEventListener((o, e) -> {
            // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
            System.out.print("Bookmark reached. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
            System.out.println("bookmark text: " + e.getText() + ".");
        });

        SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(text).get();

        // Checks result.
        if (result.getReason() == ResultReason.SynthesizingAudioCompleted) {
            System.out.println("Speech synthesized for text [" + text + "].");
            byte[] audioData = result.getAudioData();
            System.out.println(audioData.length + " bytes of audio data received for text [" + text + "]");
        }
        else if (result.getReason() == ResultReason.Canceled) {
            SpeechSynthesisCancellationDetails cancellation = SpeechSynthesisCancellationDetails.fromResult(result);
            System.out.println("CANCELED: Reason=" + cancellation.getReason());

            if (cancellation.getReason() == CancellationReason.Error) {
                System.out.println("CANCELED: ErrorCode=" + cancellation.getErrorCode());
                System.out.println("CANCELED: ErrorDetails=" + cancellation.getErrorDetails());
                System.out.println("CANCELED: Did you update the subscription info?");
            }
        }

        result.close();
    }

    synthesizer.close();
}

It outputs following:

Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Viseme event received. Audio offset: 50ms, viseme id: 0. Word boundary event received. Audio offset: 50ms, Viseme event received. Audio offset: 50ms, viseme id: 0. text offset: 175, word length: 3, word = سنو. Viseme event received. Audio offset: 50ms, viseme id: 0. Word boundary event received. Audio offset: 463ms, text offset: 179, word length: 3, word = ذرا. Speech synthesized for text [سنو ذرا]. 55342 bytes of audio data received for text [سنو ذرا]

Expected behavior

That we should receive viseme durations for all the visemes that make up the spoken words.

Version of the Cognitive Services Speech SDK

1.21.0

Platform, Operating System, and Programming Language

yulin-li commented 2 weeks ago

@LinZhang-Support could you help to check?

trulience commented 2 weeks ago

Thank you! Appreciate your attention to this.