gexgd0419 / NaturalVoiceSAPIAdapter

Make Azure natural TTS voices accessible to any SAPI 5-compatible application.
MIT License
146 stars 6 forks source link

Word boundary event not working for online voices #16

Closed PaulBlenkhorn closed 1 month ago

PaulBlenkhorn commented 2 months ago

The word boundary event works well for the offline (Narrator) voices but isn't working properly for the online Edge voices. I'm pretty sure the online voices do send word boundary information as I have this working in an Edge extension that uses the voices directly.

gexgd0419 commented 2 months ago

This engine supports word boundary events for Edge voices. Viseme events are also supported, so that you can see the animated mic in the ttsapplication.

Could you tell me which TTS client application and Edge voice you are using, what text you want it to speak, and what isn't working properly, so that I could try to reproduce the problem?

PaulBlenkhorn commented 2 months ago

Many thanks for your very quick response. I am using my own programs in c# on Windows 11. I think the easiest way to see the problem is to see the attached video which shows the word boundary events not working/synchronising for the Edge voice "Microsoft Clara online" and then working for the voice "Microsoft Jenny". This happens on all of my programs which are using SAPI which are all Winforms but have rather different architectures. The simple example shown is sending the boundary information to a WebView control, but other programs are not using the Webview control.

My code captures the word boundary event with:

   ...
    GlobalR.synthesizer.SpeakProgress += new EventHandler<SpeakProgressEventArgs>(synth_SpeakProgress);
    ...

    void synth_SpeakProgress(object sender, SpeakProgressEventArgs e)
    {
        int cPosition = e.CharacterPosition;
        string s = ">" + cPosition.ToString();
        Console.WriteLine("Position: " + s);
        this.webView21.CoreWebView2.PostWebMessageAsString(s);
    }

https://github.com/user-attachments/assets/363da80e-8dc8-4e08-a4f4-f148b6e7e3ae

gexgd0419 commented 2 months ago

Confirmed that this happens when using System.Speech.Synthesis.SpeechSynthesizer in C#. But it's still weird that TtsApplication, which is written in C++ and uses the COM API directly, doesn't seem to have this issue.

Also I found a more serious problem: calling SelectVoice to change the voice to a NaturalVoiceSAPIAdapter voice would often throw ArgumentException that said Cannot set voice. No matching voice is installed or the voice was disabled.. Does this often happen on your system?

PaulBlenkhorn commented 2 months ago

I'm glad you can reproduce the problem. It's strange that it is only with the online voices.

Yes, I also have your serious problem (with offline and online voices). I have found that the software always needs 7-10 seconds on my development machine after first starting before the voice selection works (other sapi voices do not need this delay). I have worked around the problem by repeatedly trying to initialise the voice until it succeeds at which point I break out of my loop. Once the voice has been set the error does not seem to repeat. Here is my c# code that does this: for (int i = 0; i < 100; i++) { try { synthesizer.SelectVoice(s); // btnLoad.Visible = false; Console.WriteLine(s); break; } catch { // if (i == 0) // ShowLoading(); Console.WriteLine("Failed " + i); await Task.Delay(TimeSpan.FromMilliseconds(500)); } }

PaulBlenkhorn commented 2 months ago

I can confirm that the synthesizer.SelectVoice() function is now working well on some of my programs but the delay is still needed on some of the others. I will investigate more and try to identify the issue and give you more information.

(I presume that this fix does not address the Word boundary event issue for online voices which is still not working for me in any of my programs.)

PaulBlenkhorn commented 2 months ago

I have written a small program to show the current issues and a video to show how it works.

In the video:

  1. I start the program and then fairly quickly click on Microsoft Jenny and you can see that the program takes several calls to synthesizer.SelectVoice(); to initialise the speech. The words are shown as they are spoken.
  2. I click on Microsoft Willem Online and the voice is initialised without issue. However, the Word position is incorrect and you can see a number of errors displayed in the output window of the program.

I hope this is helpful.

https://github.com/user-attachments/assets/c2af21ab-8437-433f-9f42-7a97b8ba2237

PaulBlenkhorn commented 2 months ago

... and the code App.zip

gexgd0419 commented 2 months ago

I can confirm that the synthesizer.SelectVoice() function is now working well on some of my programs but the delay is still needed on some of the others.

So what version did you use for testing? Did you clone my repo and compile it? Because I haven't released a new version yet.

I have written a small program to show the current issues and a video to show how it works.

According to the video, the program output some websocketpp logs (the [frame_payload] Payload bytes: lines in the Debug output). I have made it not output any websocketpp logs by default in commit 44d86202f681029b151dfa3ff3576565f7cca299, so this seemed to be an older version.

I tried your program. On my system and with my newest version, Microsoft Jenny voice could be used with no delay.

PaulBlenkhorn commented 2 months ago

I did clone your repo and rebuilt it, but forgot to register NaturalVoiceSAPIAdapter.dll - mea culpa. Now I have done that Jenny loads with no delay. Any idea why the online voices are not sending the word boundary event correctly?

(As you are making a tool to make the "Narrator" Natural voices accessible to SAPI, you may be amused to know that I wrote the original Narrator for Microsoft over 20 years ago.)

gexgd0419 commented 2 months ago

Seems that the C# System.Speech module uses its own mechanism to access the COM SAPI voices. Usually clients create instances of SpVoice objects, and let the SAPI framework handle the interactions with the TTS engine. But System.Speech doesn't use SpVoice. Instead, it has a whole set of different COM interop classes, and uses these to interact with TTS engines directly. Although System.Speech tries to replicate the SAPI framework's behavior, the differences in their implementations cause some problems.

For example, this TTS engine sends event information with the correct timestamps during speaking. The SAPI framework respects the timestamps, and will deliver the events to the client at the correct time. System.Speech, however, seems to just ignore the timestamps, and deliver the events to the client at the time the event are generated.

Local Narrator voices use Azure Speech SDK as the backend, and it will do the synchronization for us. Edge voices use my own implementation (as they are not supported by SDK), and my engine will parse the information received from the server. The server sends all event information with their timestamps first, followed by the actual audio data. My engine will immediately pass the received events to the SAPI framework, so all the events will come before the audio, which I guess may be the reason why the word boundary events (and maybe all events) are out of sync when System.Speech is acting as the SAPI framework.

If my guess is correct, synchronizing the events in the engine myself would fix this issue.

For reference, here's part of the implementation of the engine site object in System.Speech, that will be passed into SAPI TTS engines to let the engine pass the synthesized audio and events back to the SAPI framework and then to the client app.


internal class EngineSite : ITtsEngineSite, ITtsEventSink
{
    // ...
    public void AddEvents([MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] SpeechEventInfo[] events, int ulCount)
    {
        try
        {
            for (int i = 0; i < events.Length; i++)
            {
                SpeechEventInfo sapiEvent = events[i];
                int num = 1 << (int)sapiEvent.EventId;
                if (sapiEvent.EventId == 2 && _eventMapper != null)
                {
                    _eventMapper.FlushEvent();
                }
                if ((num & _eventInterest) != 0)
                {
                    TTSEvent evt = CreateTtsEvent(sapiEvent);
                    if (_eventMapper == null)
                    {
                        AddEvent(evt);
                    }
                    else
                    {
                        _eventMapper.AddEvent(evt);
                    }
                }
            }
        }
        catch (Exception exception)
        {
            _exception = exception;
            _actions |= SPVESACTIONS.SPVES_ABORT;
        }
    }
    // ...
    private TTSEvent CreateTtsEvent(SpeechEventInfo sapiEvent)
    {
        switch ((TtsEventId)sapiEvent.EventId)
        {
        case TtsEventId.Phoneme:
            return TTSEvent.CreatePhonemeEvent(((char)((uint)(int)sapiEvent.Param2 & 0xFFFFu)).ToString() ?? "", ((char)((uint)sapiEvent.Param1 & 0xFFFFu)).ToString() ?? "", TimeSpan.FromMilliseconds(sapiEvent.Param1 >> 16), (SynthesizerEmphasis)((int)sapiEvent.Param2 >>> 16), _prompt, _audio.Duration);
        case TtsEventId.Bookmark:
        {
            string bookmark = Marshal.PtrToStringUni(sapiEvent.Param2);
            return new TTSEvent((TtsEventId)sapiEvent.EventId, _prompt, null, null, _audio.Duration, _audio.Position, bookmark, (uint)sapiEvent.Param1, sapiEvent.Param2);
        }
        default:
            return new TTSEvent((TtsEventId)sapiEvent.EventId, _prompt, null, null, _audio.Duration, _audio.Position, null, (uint)sapiEvent.Param1, sapiEvent.Param2);
        }
    }
}

AddEvents implements ISpEventSink::AddEvents, which is what TTS engines should call to tell SAPI about their events. But this implementation just assumes that _audio.Duration is the time position of the event, which is calculated based on the written byte count:


    internal override TimeSpan Duration
    {
        get
        {
            if (_nAvgBytesPerSec == 0)
            {
                return new TimeSpan(0L);
            }
            return new TimeSpan((long)_bytesWritten * 10000000L / _nAvgBytesPerSec);
        }
    }
PaulBlenkhorn commented 2 months ago

"If my guess is correct, synchronizing the events in the engine myself would fix this issue." Yes, I think that should fix it. The only way I've accessed the Edge voices is through a Chrome/Web extension where the timing of the word boundary event is fine. Here's some of my (javascript) code which is in a background script, but don't think it is that relevant for you. [I was actually sending the information to a C# host using Native Messaging but have abandoned that now as your solution will be much better.) chrome.tts.speak(response.speak, { voiceName: voice, pitch: params.voicePitch, rate: params.voiceRate, volume: params.voiceVolume, requiredEventTypes: ['end', 'word'], onEvent: function (event) { if (event.type === 'end') { console.log("Speech ended."); port.postMessage({ text: "End" }); // port.postMessage({ index: "-1", length: "-1" }); } if (event.type === 'word') { console.log("W: " + event.charIndex.toString(), "L:" + event.length.toString()); port.postMessage({ index: event.charIndex.toString(), length: event.length.toString() }); } } });

"I guess may be the reason why the word boundary events (and maybe all events) are out of sync" - that would seem very plausible.

PaulBlenkhorn commented 2 months ago

FYI: My free web extension can be found here: https://microsoftedge.microsoft.com/addons/detail/readableweb/pfagdimehoadoklbcbheaahkeamhohbp

It is free but not open source. However, if you email me privately at paul.blenkhorn@googlemail.com I will be happy to share any of the code with you.

PaulBlenkhorn commented 2 months ago

That's great. I've downloaded and compiled your new source and that seems to work. Will do some more testing but think you are there :)

PaulBlenkhorn commented 1 month ago

I think this is now fixed.

gexgd0419 commented 1 month ago

A new version v0.2 has been released!