Built-in Text To Speech audio generation

5H4D0W-X commented 6 months ago

Is your feature request related to a problem? Please describe.

At the moment, text to speech audio cannot be generated locally and in-game. It always has to be downloaded from an external server or requires third-party tools to be run next to resonite. The web download introduces latency which can be different per person and may be a privacy risk, since sensitive text might be getting sent to third-party servers. The fact that everyone in a session has to download the same clip individually may also cause much greater network load wherever the audio is generated.

Describe the solution you'd like

A component that takes a string input for the text to be spoken (any maybe an enum or string input to select language/voice/tone) and provides an audio clip and a boolean for when generation has finished.

Describe alternatives you've considered

As described in the first section, running the synthesis on a server and downloading it to each game locally or running an application in the background that then somehow transfers that audio to resonite (this introduces extreme variations in loading latency which is undesirable).

Additional Context

I don't know of any specific text to speech generators that would be good for resonite, so suggestions would be apprechiated. Quick generation is important for many accessibility applications such as screenreaders (or the resonite equivalent, see #1458) where short clips would be played as the user interacts with an object

Requesters

ShadowX (@ shdw_x)

Frooxius commented 6 months ago

There's to approaches for this: 1) We find a good local C# library/API for this to integrate - however this need to be multi-platform 2) We provision a 3rd party service - however this will cost us, so this would be a paid feature likely

For 1), I haven't done any research on my own. If people have suggestions that can potentially help.

PJB3005 commented 6 months ago

There is also the option of using operating system built-in TTS support. These are intended for accessibility tools such as screen readers. They're not "high quality" but they are always available on modern operating systems (Windows and macOS, Linux is annoying) and don't cost significant resources to actually render the audio. That said it wouldn't be consistent between OSes which may be a problem.

amplified1 commented 6 months ago

There is also the option of using operating system built-in TTS support. These are intended for accessibility tools such as screen readers. They're not "high quality" but they are always available on modern operating systems (Windows and macOS, Linux is annoying) and don't cost significant resources to actually render the audio. That said it wouldn't be consistent between OSes which may be a problem.

OS functions should likely be avoided for the sake of future-proofing, Resonite may eventually want to support standalone headsets, or an OS might modify how their TTS works.

Frooxius commented 6 months ago

Yeah the OS specific ones aren't really multi-platform. It can be potentially used, but it makes the functionality not something you can rely on always being present and providing consistent results, so we're much more less likely to add it in some generic manner.

They also don't tend to be very good in my experience.

shiftyscales commented 6 months ago

Text to speech was one of the requests mentioned in #50.

Yellow-Dog-Man / Resonite-Issues