Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.85k stars 1.84k forks source link

Sample bookmark listener from tts Azure documentation not working #1245

Closed pierat closed 2 years ago

pierat commented 3 years ago

Hi,

In a java application, I try to use bookmarks for evaluating audio offsets in a text-to-speech conversion and even the sample code from the tts documentation is giving false results.

Any idea on what is the problem in my coding or a limitation that applies ?

Here is the code :

    private void test() {
    String speechSubscriptionKey = "38blabla";
    String serviceRegion = "westeurope";
    config = SpeechConfig.fromSubscription(speechSubscriptionKey, serviceRegion);
    assert (config != null);
    config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz96KBitRateMonoMp3);

    String ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"en-US\">\r\n"
            + "    <voice name=\"en-US-AriaNeural\">\r\n"
            + "        We are selling <bookmark mark='flower_1'/>roses and <bookmark mark='flower_2'/>daisies.\r\n"
            + "    </voice>\r\n" + "</speak>\r\n" + "";
    assert (ssml != null);

    SpeechSynthesizer synth;
    synth = new SpeechSynthesizer(config, null);
    assert (synth != null);
    synth.BookmarkReached.addEventListener((o, e) -> {
        // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by
        // 10,000 to convert to milliseconds.

        System.out.println(
                "Bookmark " + e.getText() + " reached. Audio offset: " + e.getAudioOffset() / 10000 + "ms.");
    });

    // creates voice
    SpeechSynthesisResult result = synth.SpeakSsml(ssml);
    assert (result != null);

And here is the result :

Bookmark flower_1 reached. Audio offset: 50ms.
Bookmark flower_2 reached. Audio offset: 50ms.

Which is not the expected result.

My configuration for this test : Windows 10 with java jdk 1.8.0_301.

pierat commented 3 years ago

Seems to be a bug because the code is really simple and exactly conforming to all samples found in the documentation. Is this feature really implemented in Java ?

yulin-li commented 3 years ago

Hi @pierat , I can repro this bug, and forwarded this to service guy. I will let you know if I get any updates.

brianarch82 commented 3 years ago

I am also experiencing this bug from .NET.

yulin-li commented 2 years ago

Just synced with service guy, the ETA to fix this issue is end of Nov. Thanks!

alex-uspenskyi commented 2 years ago

Experiencing the same bug with NodeJS SDK. Will the fix also affect NodeJS SDK or do I need to create the issue in https://github.com/microsoft/cognitive-services-speech-sdk-js ?

DavidWyand commented 2 years ago

Using the C++ SDK, I am experiencing the same issue. Using the Speech SDK 1.19.0: 2021-Nov release, which is the most recent as of this message. If it would be preferable for me to open a new issue, please let me know. But this appears to be service wide rather than specific to a language SDK.

DavidWyand commented 2 years ago

An update on my findings: If you enable viseme generation for a voice (I'm testing with US English) then the bookmark timings are correct. Without generated viseme timings, the bookmark timings are incorrect.

pankopon commented 2 years ago

@yulin-li Is the service issue possibly still in effect? Can we assign this to someone in the service team?

yulin-li commented 2 years ago

@pankopon as far as I know, this is in backlog, I will check status. I know who is the owner but I don't know his GitHub handle.

yulin-li commented 2 years ago

Assign to the owner @newhillchan

TalissaDreossi commented 2 years ago

I'm getting a similar error using javascript: the events are raised at the very beginning of the synthesis. Audio offsets are different but still wrong. Are there any update?

pierat commented 2 years ago

@yulin-li Hi, I'm back on this issue : is there a planning for this bug to be solved ? Should really be useful for me to know if we need to continue with manual workaround or if we can plane development using this feature.

Thanks !

yulin-li commented 2 years ago

Hi @pierat, thanks for you patience and I just confirmed with the service engineers, and the ETA of this issue is 5/31.

vk0novalov commented 2 years ago

Hello @yulin-li are there any updates about this feature? Thanks!

pankopon commented 2 years ago

I tested this today with the exact code from https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1245#issue-988451704 and Speech SDK 1.22.0, using northeurope region. The resulting audio offsets match the generated audio (if it's written to a file and viewed e.g. in Audacity):

Bookmark flower_1 reached. Audio offset: 737ms.
Bookmark flower_2 reached. Audio offset: 1250ms.

So the issue seems to be fixed by now. To be closed if there are no other pending items soon.

pankopon commented 2 years ago

Closed as resolved, please open a new issue if further support is needed.