Provide guidance to help developers of voices

ways2read commented 8 months ago

This may be something to move to a different repo since it is not specific to the UI.

There are organizations such as Hear2Read and DAISY Lanka that are developing new voices for languages that are unsupported by big tech. These types of approaches could also be interesting to DAISY members in (for example) USA and Canada where there are indigenous languages and few available narrators. The voices developed in India and Sri Lanka are working on Android and could easily port to Windows.

For these voices to be leveraged by the Pipeline, what is the recommendation for the API they should provide? Is the OneCore interface extensible to third party voices? Is there a modern interface for voices that could be good to consider? There appears to be some reluctance to develop a SAPI interface. Perhaps this is considered a dated specification.

Thoughts please?

NPavie commented 7 months ago

Hi @ways2read, I don't have much intel on the Onecore voice creation process, but there was a similar request in the SaveAsDAISY addin repo : https://github.com/daisy/word-save-as-daisy/issues/34

I'll let @bertfrees or @rdeltour give their thoughts on the API recommendation, but it will probably depend on which platform should the voice be available on. (i. e. If it is linux ou MacOS, SAPI/Onecore cannot be recommanded)

Just for quicly summary how the pipeline interact with SAPI/Onecore API, a connector with native binaries (one for sapi, one for onecore) was created to look at and use voices installed on the system registry. Those voices are the one exposed in the Windows narrator menu for Onecore voices, or in the older speech synthesis dialog for SAPI ones. The pipeline prioritizes onecore voices over sapi ones but only for "duplicated" microsoft voices that exists in both engines (same voice in both engines, only differing by a suffix on SAPI voices)

Based on some observations during maintenance of the sapi/onecore adapter of the pipeline, Microsoft Onecore and SAPI voices structures on desktops seem nearly identical, the Onecore desktop TTS engine being a kind of SAPI "simplified" version, but i'll need to make more research on the subject.

So far in my researches, nothing indicates that Onecore TTS engine is "officially" extensible to third party vendors. But maybe we could ask Microsoft some documentation directly for voice creation for Onecore directly (if there is a more modern interface compatible with Onecore other than SAPI that is not publicly promoted, even if I think Microsoft is more inclined to promote Azure Text-To-Speech currently).

For now, the only documentation I found is for SAPI voices creation with this whitepaper : https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431802(v=vs.85) I did not find much information on voice creation in the newer Microsoft speech runtime documentation (this is the onecore API): https://learn.microsoft.com/en-us/previous-versions/office/developer/speech-technologies/hh362831(v=office.14)

The discussion could be issued in the pipeline repository, maybe some other DAISY members have more intel on what speech synthesis API exists in other projects and how to extend them ?

bertfrees commented 7 months ago

This is what I said to Avneesh and Romain about the subject:

Currently we use the Google Cloud Text-to-Speech REST API directly. We're not using the Java library. So if they would mimic the REST API, their voices could be used in Pipeline. I have done something similar to test our client code: https://github.com/daisy/pipeline-modules/blob/master/tts/tts-mocks/src/main/java/org/daisy/pipeline/tts/mock/impl/MockGoogle.java. I guess another possibility would be to mimic the Azure API. Note however that for Azure we're currently using a Java library, we're not using the REST API directly. Developing a generic REST API for Pipeline seems overkill. But of course, if they are willing to write a Java library that implements our TTS API, that would also be a solution. The benefit of implementing an API such as the Google or Azure one is that it is more widely usable. But it might not be trivial to be fully compatible, including on the SSML level.

daisy / pipeline

Provide guidance to help developers of voices #773