Add ability to get a "preview" of a voice

bertfrees commented 3 months ago

We discussed the following:

For the UI, we've looked at how other software does it. Usually there is a fixed string that is translated into the correct language and used for generating the sample.
- Google Cloud TTS says: "With cloud machine learning, your application interprets images, texts, and more"
- Siri says: "Hello. My name is Ralph (or whatever)"
- Windows 11 says: "You have selected |voice name| as the default voice"
Sometimes, the sample text can be changed.
- In the older SAPI interface (Windows 7 and before), Microsoft gives users a text box to specify the sample text.
The rate should affect the sample
We should try to handle the case where the user runs through the voices quickly in the "selected voices" table. The experience shouldn't be sluggish.

The approach we're going to try first is that the UI will fetch and cache audio files for some voices that are most likely to be requested next by the user. The user might get a “fetching previews” waiting message sometimes but it would speed up as the number of cached previews increased. For the case where the user inputs text to be generated as a preview, then it’s OK if they wait a second.
The web interface will be a link to a wav or mp3 file. The endpoint will be:

http://localhost:8181/ws/voices/$ID/preview?text=foo+bar&speech-rate=120%25

The ID of the voice will be included in the result of the http://localhost:8181/ws/voices endpoint, e.g.:
```
<voices xmlns="http://www.daisy.org/ns/pipeline/data" href="http://localhost:8181/ws/voices">
  <voice engine="espeak" gender="male-adult" lang="en-GB-x-gbclan" name="English_(Lancaster)" id="1"
         preview="http://localhost:8181/ws/voices/1/preview"/>
  <voice engine="espeak" gender="male-adult" lang="en-GB-x-rp" name="English_(Received_Pronunciation)" id="2"
         preview="http://localhost:8181/ws/voices/2/preview"/>
  <voice engine="espeak" gender="male-adult" lang="en-GB-x-gbcwmd" name="English_(West_Midlands)" id="3"
         preview="http://localhost:8181/ws/voices/3/preview"/>
  ...
</voices>
```
The text parameter should probably be optional. If omitted, a stock message in the correct language will be played.

For the TTS engines that support changing the speech rate, the sample is affected by the org.daisy.pipeline.tts.speech-rate property. The "speech-rate" parameter is optional and can be one of the following:
- x-slow, slow, medium, fast, x-fast, or default
- a non-negative percentage, which acts as a multiplier of the default rate
- a number that represents speaking rate in words per minute

bertfrees commented 2 months ago

This is done except for localizing stock message.

bertfrees commented 2 months ago

Note that it is not possible to post a TTS config file with the /voices/[ID] and /voices/[ID]/preview endpoints. Only the global settings are used. This is a reasonable limitation, because setting properties inside a TTS config file is deprecated, and the rest of the TTS config file can not influence the voices.

Also note that /voices/[ID] and /voices/[ID]/preview calls need to be preceded by a /voices call. (This is needed to get the preview links.) Settings may not be changed after the last /voices call.

This last limitation might not be RESTful, but it was done in order to be able to do the necessary caching to make the whole thing snappy enough. It should be improved later. E.g. the caching should be done per client.

bertfrees commented 1 month ago

Fixed in 2be91f131

daisy / pipeline-modules

Add ability to get a "preview" of a voice #89