Speech Synthesis with the SAPI5 Adapter Sometimes Fails When Using the Microsoft Natural Voices for Narrator

emassey0135 commented 4 months ago

Expected Behavior

I was trying to generate a DAISY3 audiobook with TTS using the Microsoft natural voices for Narrator. I used NaturalVoiceSAPIAdapter to make these voices available to SAPI5, then created a TTS configuration file that sets one of these voices as highest priority.

Actual Behavior

Some sentences failed to be spoken using the voice I chose, and were synthesized successfully using another of these Microsoft natural voices instead. Most sentences were spoken successfully using the correct voice, and it seems very random which sentences failed. The error always says: "Could not speak : speech mutex lock has timedout"

Steps to Reproduce

Install one or more natural voices for Narrator using these instructions.
Download the zip archive from the latest release of NaturalVoiceSAPIAdapter.
Unpack the zip archive into a folder. If you move this folder after installing the SAPI5 voices, you will need to uninstall them and install them again.
Run Installer.exe.
Make sure "Include Narrator natural voices" is checked, and uncheck "Include Microsoft Edge natural voices", since these voices do not support SSML marks and make the SAPI5 tests in the DAISY Pipeline fail.
Press "Install" for both 32-bit and 64-bit, and press "Yes" on the UAC prompts that come up.
Restart the DAISY Pipeline engine if it is already started.
Either create a TTS configuration file specifying one of the Microsoft natural voices and assigning it priority 1 like the one I attached, or from the Pipeline UI, open "Settings" from the "File" menu, go to the "Voices" tab, and check the box for the Microsoft natural voice you want to use and press "Close".
Run the script "dtbook-to-daisy3" with TTS enabled, as well as include TTS log. If you are using the CLI, specify the path of the TTS configuration file you created with the "--tts-config" option. If you are using the GUI, this is handled for you if you selected a voice in the settings.

Details

These voices and this SAPI5 adapter seem reliable when I tested them in other applications. I used one for NVDA for a while with no issues, and they also work fine in Book Wizard Producer. I even converted the exact same book in Book Wizard Producer that I tried to convert with DAISY Pipeline, using the same voice, and there were no errors. I also tried changing to a different Microsoft natural voice and got the same error. In addition, I converted the same book to audio with the Microsoft David OneCore voice several times with no issues. Changing org.daisy.pipeline.tts.threads.number or org.daisy.pipeline.tts.encoding.speed did not seem to help. I tried to find other SAPI5 voices to test with, but Eloquence seems not to use SSML marks correctly making DAISY Pipeline refuse to use it, and the Windows version of eSpeak only has a 32-bit installer so Pipeline did not find the eSpeak SAPI5 voices.

Could this issue be caused by the Microsoft natural voices taking longer to speak than the DAISY Pipeline expects, since they definitely take longer than most other SAPI5 voices? Or perhaps the voices or this adapter crashes when multiple threads try to speak at the same time?

Environment

Operating system: Windows 11 64-bit
DAISY Pipeline 2 version: 1.14.16
Interface: command line interface

Logs

Job log file, TTS log, TTS config, and source DTBook

bertfrees commented 4 months ago

@NPavie Any idea what could be wrong here?

NPavie commented 4 months ago

@emassey0135 thanks for the detailed report.

@bert I'm not sure, might be the mutex lock we use to solve the issue with Window 11 multithread speech issue that is not waiting long enough.

I'll do some tests to reproduce the issue and investigate.

bertfrees commented 3 months ago

@NPavie Thanks.

NPavie commented 2 months ago

Just some quick news on the subject :

We started fixing the pipeline internal adapter to SAPI and Onecore voices to fix issues with OneCore on windows 11 (that is crashing after recent windows 11 updates and changes in the windows runtime library used to connect to the Onecore API).

We started testing the use of natural voices exposed by the NaturalVoicesSAPIAdapter tool and some problems are encountered with those voices :

The natural voices that can be installed though the Microsoft Store and exposed by the tool are treating the underlying SSML differently that the legacy SAPI and Onecore desktop voices. In some cases that we are investigating, the voice does not report the expected number of "marks" that are added by the pipeline process.
The Natural voices are by an order of magnitude slower than legacy voices, so an internal timeout in the adapter needs to be changed.
The online "Edge" voices are not reporting marks, so they cannot be used right now within the adapter.

emassey0135 commented 2 months ago

@NPavi Thanks for the update. At least the DAISY Pipeline can already use the same voices Edge uses through the Azure speech adapter, although with that you have to pay for the Azure Speech API. The Edge/Azure voices sound a little better to me, but the Narrator natural voices still have the advantage of being faster and being free no matter how much text you convert, and they are still much more natural sounding than probably any other SAPI5 voices, so I'm glad those will probably be able to work well.

emassey0135 commented 2 months ago

@NPavie It seems that in the latest release of NaturalVoiceSAPIAdapter, the Edge online voices now support SSML marks. The Edge voices do not support them directly, but the adapter now simulates SSML marks with word boundary events as described in this issue.

NPavie commented 2 months ago

@emassey0135 interesting! I'll take a look a the latest version of the tool, thanks !

NPavie commented 3 weeks ago

News update on the issue :

I can confirm that with NVSA 0.2, edge voices are indeed reporting bookmarks events.
I have to change how marks are handled by our adapter by a small cheat : with natural voices having issues reporting marks correctly (marks being sometimes ignored if they are quotation characters before or after the mark), I have to fallback to use word boundaries and check between words for the presence of marks to report them.

I did some tests with this bypass and so far I did not encounter any issue during synthesis, but i'll round up more tests in production just in case.

daisy / pipeline