Open Thaina opened 1 year ago
Came here to suggest this as well. IPA offers an escape hatch for use cases that need a higher level of control.
My use case: I’ve been playing around with text-to-speech for a toddler reading app, and finding the language specification too inconsistent. E.g. I want pressing a letter to make the sound the letter makes, but that seems impossible, at least in Greek. Instead the letter name is spoken. Furthermore, even for entire syllables, I've seen platforms just pronounce the letter names one after another, whereas others speak the actual syllable.
Since most text-to-speech systems are trained on samples from real languages, it could still take a language as a parameter. That would even enable use cases like "Speak English with a French accent" (by converting the English to IPA and setting the language to French).
Why not just allow to load custom trained voices?
Instead of language specific voice, we could share all voice for many language with IPA specification. We can translate word from many language into string of IPA and let voice synthesis read out from the same shared voice model. Because many language shared the same pronunciation of vocab, it should decrease overall voice model data
We should also have API for IPA dictionary per language and so we could convert text that may contains many language into IPA and synthesize the voice to play it with one sentence