coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
33.37k stars 4.05k forks source link

[Feature request] [TTS] Support SSML in input text #752

Closed mariusa closed 2 years ago

mariusa commented 3 years ago

Is your feature request related to a problem? Please describe.

For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:

Describe the solution you'd like Support SSML / coqui markup in input text. Example:

- <tts model="male_voice_1"> Check the box under the tree </tts>
- <tts model="child_voice_1"> This one?   <tts emotion="joy">Wow, it's the Harry Poter lego!</tts>  </tts:model>
nitinthewiz commented 3 years ago

Seconded! SSML would be a very valuable value-add to the TTS. It would be specially useful for controlling pauses, linebreaks, emotion (if possible, using heightened pitch), urgency (by increasing the speed of spoken text to 1.5x).

It would also be useful in multi-speaker models where we would give speaker ID in the SSML itself and Coqui would string them together. Though this would be a stretch goal to the basic SSML implementation.

Please let us know how we can help in this. Is there an SSML implementation you need us to research? (like gruut was integrated, perhaps we can integrate an existing SSML framework as well). Is there some code we can contribute?

synesthesiam commented 3 years ago

Which SSML tags/properties do you think would be the most valuable to implement?

nitinthewiz commented 3 years ago

Well, here are the tags that would be most relevant according to me (in some order of relevance) -

tag Explanation
<speak></speak> this encapsulates the entire SSML section, telling Coqui that this section of the text must be treated as SSML. This also allows us to intersperse SSML and non-SSML text in the same input.
<break /> this tag is used to give a pause in the speech. We can also add time="3s" and other parameters to accommodate for how long the break must be
<say-as interpret-as="spell-out"> or <say-as interpret-as="cardinal"></say-as> this would tell Coqui that the enclosed text must be treated as special. One of the things I've noticed with gruut is that it doesn't know how to say capitalized letters like USA, CDN, AWS. I haven't tried numbers, but that's what cardinal is for. This can also be extended to currency and then country of currency, so we can do localization. e.g. Million in the US is spoken as 10 Lakhs in India.
<voice name="Mike"> <voice id="p235"></voice> this is very useful for multi-speaker models, to specify the voice we want to use, and also (as a stretch goal) to create multi-voice audio, creating the potential for dialogs between voices
<prosody> This tag is useful for a number of things - setting the volume of the enclosed text, setting the rate of the speech, setting the pitch so that voices can be made more unique.

There are a lot of implementations and features that are non-standard, but can be very useful, such as <emotion name="excited"> or <alias>, which can be used to say words like element names (Mg spelled out as Magnesium) and spell "Mr." out etc. But those are enhancements that companies have done to their SSML implementations and we do not necessarily have to follow.

Sources - W3 SSML documentation Amazon's SSML implementation Microsoft's SSML implementation

By the way, I have a question about how SSML is implemented in neural TTS - I do not understand how the SSML tags would be translated to the voice. Would we need to train models which have different pitch, pauses, and volumes? Would we need to train models that know how to pronounce certain words we ask them to spell out (like USA, AWS, ISIS etc)?

Could you help me understand how this would be implemented?

erogol commented 3 years ago

@nitinthewiz thx for the great post. All the use-cases make sense, however, implementing SSML required a lot of effort. I think we can start implementing some of the basic functionalities and expand them as we go.

I don't know when we can start implementing SSML but I add it to our task list here https://github.com/coqui-ai/TTS/issues/378

When it comes to your question, some basic manipulations (speed, volume, etc.) are straightforward to implement with a single model. However, some needs model-level architectural changes or improvements, as you noted, emotions, pitch, and so on.

synesthesiam commented 3 years ago

@erogol I would be interested in starting on this. Some tags can be handled by gruut, such as <say-as> while others will need to be passed through to TTS.

It may be worth (me) implementing support for PLS lexicons as well, so users could expand gruut's vocabulary.

nitinthewiz commented 3 years ago

@erogol thanks a lot for following up and for the explanation!

@synesthesiam let me know how I can help with the lexicon, or once you've implemented it, we can start contributing to the vocab.

synesthesiam commented 2 years ago

Small update: I've got preliminary SSML functionality in a side branch of gruut now with support for:

Numbers, dates, currency, and initialisms are automatically detected and verbalized. I've gone the extra mile and made full use of the lang attribute, so you can have:

<speak>
  <w lang="en_US">1</w> 
  <w lang="es_ES">1</w>
</speak>

verbalized as "one uno". This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in :frog: TTS with <voice> to generate multi-lingual utterances.

The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages:

erogol commented 2 years ago

I think default formats need to be handled by the text normalizer in a way that the model can read. Is this what you also mean? @synesthesiam ?

synesthesiam commented 2 years ago

Yes, and also the normalization needs to mirror what the speaker likely did when reading the text. So when gruut comes across "4/1/2021" in text, it needs to come out as the most likely verbalization in the given language/locale.

For U.S. English, "4/1/2021" becomes "April first twenty twenty one". For German, it is "Januar vierte zweitausendeinundzwanzig" instead, which I'm hoping is the right thing to do.

Regarding punctuation, I know that dashes and underscores (and event camelCasing) can be used to break apart English words for the purpose of phonemization -- "ninety" and "nine" are likely in the lexicon, but "ninety-nine" may not be. But this gets more complicated in French: "est-que" is present in the lexicon and is not the same as the phonemes("est") + phonemes("que"). So what I'm doing now is checking the lexicon first, and only breaking words if they're not present.

synesthesiam commented 2 years ago

@erogol It might be worth moving this to a discussion

I've completed my first prototype of :frog: TTS with SSML support (currently here)! I'm using a gruut side branch for now (supported SSML tags).

Now something like this works:

SSML=$(cat << EOF
<speak>
  <s lang="en">123</s>
  <s lang="de">123</s>
  <s lang="es">123</s>
  <s lang="fr">123</s>
  <s lang="nl">123</s>
</speak>
EOF
)

python3 TTS/bin/synthesize.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/es/mai/tacotron2-DDC \
    --extra_model_name tts_models/fr/mai/tacotron2-DDC \
    --extra_model_name tts_models/nl/mai/tacotron2-DDC\
     --text "$SSML" --ssml true --out_path ssml.wav

Which outputs a WAV file with:


Before getting any deeper, I wanted to see if I'm on the right track.

The three main changes I've made are:

  1. Support for multiple TTS models/SSML input in the Synthesizer
  2. Ability to load additional TTS models when running the server.py and synthesize.py scripts (--extra_model_name)
  3. Changes to the web UI and API to support SSML and TTS model selection

Synthesizer

I created a VoiceConfig class that holds the TTS/vocoder models and configs. When creating a Synthesizer, there is now an extra_voices argument that accepts a list of VoiceConfig objects to load in addition to the "default" voice.

The Synthesizer.tts method now operates in two modes: when the ssml argument is True, it uses gruut to partially parse and split the SSML into multiple sentence objects. Each sentence object is synthesized with the correct TTS model, referenced in one of two ways:

If no voice or language is specified, the default voice is used.

Command-Line

The server.py and synthesize.py scripts now accept a --extra_model_name argument, which is used to load additional voices by model name:

python3 TTS/server/server.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/en/vctk/vits

The default voice is specified as normal (with --model_name or --model_path). All of the extra voices can (currently) only be loaded by name with their default vocoders.

Additionally, the synthesize.py script accepts a --ssml true argument to tell :frog: TTS that the input text is SSML.

Web UI

coqui-ssml

The two web UI changes are:

erogol commented 2 years ago

@synesthesiam it is a great start to SSML!!!

I think we should also decide how we want to land SSML to the library architecture. Before saying anything, I'd be interested in hearing your opinions about that.

synesthesiam commented 2 years ago

The biggest change so far is SSML being able to reference multiple voices and languages. In the future, <break> tags and prosody will also introduce new challenges.

Architecturally with SSML, text processing, model loading, and synthesis are all tied together at roughly the same abstraction level. Some important questions are:

Is the user (or code) required to pre-load all relevant models, or could it happen dynamically? If dynamic, is the user able to specify a custom model? Perhaps the <voice> tag could be extended to support model/vocoder paths or URIs.

With the proper use of a ThreadPoolExecutor, model loading can be semi-parallelized along with synthesis (I've done something very similar in Larynx already). This is especially useful in a multi-voice context, where synthesis for already-loaded models can proceed during the loading of newly encountered models.

Text processing with str.replace and re.sub is too low-level for SSML, even just operating on the text between tags. Explicit sentence and word boundaries (<s> and <w>) need to be respected, as well as <say-as> and <alias>.

Phonemization is no longer an independent stage either, since the <phoneme> tag can override a word or phrase. gruut goes even further by avoiding post-processing of words that are already in the lexicon. For example, "NASA" is correctly pronounced as /nˈæsə/ whereas NASAA is pronounced like "N", "A", "S", "A", "A".

gruut's TextProcessor constructs a tree from the initial SSML, and then iteratively refines the leaves during each stage of its pipeline. This keeps the overall structure intact, but allows for sentences/words to be moved, tagged, broken apart, or ignored.

Maybe :frog: TTS could plug user-defined functions into this pipeline? They don't have to operate on the whole graph, many of mine just word on a single word at a time. For example, this code converts numbers into words for any language supported by num2words.

Depending on where you are in the pipeline, user-defined functions could also operate specifically on numbers, dates, currency, etc. I have code, for instance, that verbalizes numbers as years similar to your code, but done in a (mostly) language-independent manner.


I'll stop before this gets any more long-winded as see what your thoughts are :slightly_smiling_face:

erogol commented 2 years ago

I think users should define not only the language but also the model name and we can load models dynamically. Something like en/tacotron-ddc. Also, we can define default models for each language to be loaded when no model name is defined.

Threading would be a nice perf improvement too.

I think before we go and solve SSML we need to write up a Tokenizer class to handle all the text processing steps. This would make the code easier to manage. Then we can inherit it or pass as a class member for/to SSMLParser.

My understanding of SSMLparser is that it parses the given text, returns the text and the meta-data (SSML values) alongside it. So this meta-data is taken by the Synthesizer and then it calls the right set of functions to do TTS, interfacing the model.

But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags.

synesthesiam commented 2 years ago

Also, we can define default models for each language to be loaded when no model name is defined.

A default model for each dataset may be worth it too. So, "ljspeech" could default to whatever the best sounding model is currently.

I think before we go and solve SSML we need to write up a Tokenizer class to handle all the text processing steps. This would make the code easier to manage. Then we can inherit it or pass as a class member for/to SSMLParser.

This sounds reasonable, though I suspect over time that the Tokenizer and SSMLParser will end up merging. The non-SSML case can always be handled by just escaping the input text (relative to XML) and wrapping it in a <speak> tag. Then you can have a single class with overridable methods for splitting text into words, verbalizing numbers [1], expanding abbreviations, etc.

[1] For example, is "1234" a cardinal number, ordinal number, year, or digits? The Tokenizer would still need this context in order for the SSMLParser to use it properly.

But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags.

Another approach is to parse everything into the metadata and leave it up to the Synthesizer to decide what to ignore. If the Tokenzier/SSMLParser returns Sentence/Word objects with the metadata embedded, this would be straightforward. A word may be marked with <emphasis> in the input text, and show up as Word(text="word", emphasis=True), but the Synthesizer can just focus on the text if the underlying model doesn't support emphasis.

erogol commented 2 years ago

When I say Tokenizer, I mean something that the model can also use in Training. So as there is no use for SSML in training, it makes sense to use the Tokenizer as the base class I guess. But I mostly agree with you for inference.

Tokenizer can have preprocess, tokenize, and postprocess steps and we can deal with the contextual information in the preprocess step by providing the right set of preprocessing steps for the selected language.

I don't like the "ignoring" idea since then the user does not really know what really works and what doesn't . To me, defining the available tags based on the selected models makes more sense. But it is also definitely harder than just ignoring. Maybe we should start by ignoring for simplicity.

synesthesiam commented 2 years ago

I'll implement a proof of concept with the Tokenizer idea :+1:

davidak commented 2 years ago

oh no

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

mariusa commented 2 years ago

not stale

lebigsquare commented 2 years ago

<emphasis></emphasis>would be a good one to implement.

https://cloud.google.com/text-to-speech/docs/ssml#emphasis

Is this <emphasis level="moderate">your</emphasis> bag ?

erogol commented 2 years ago

TokenizerAPI is WIP https://github.com/coqui-ai/TTS/pull/1079

erogol commented 2 years ago

@synesthesiam any updates?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

davidak commented 2 years ago

No activity does not mean that it is not important anymore.

jeremy-hyde commented 2 years ago

The <mark> tag is also very useful. Google use it to extract timestamp of the generated audio. Useful to know where things are said

Google doc W3C ref

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol commented 2 years ago

@WeberJulian 👀

WeberJulian commented 2 years ago

Hey I started hacking some basic SSML support using Gruut as parsing. For now it's really just a draft but you can try it out here https://github.com/coqui-ai/TTS/pull/1452

liaeh commented 1 year ago

@WeberJulian is there an update on adding SSML support to coqui?

Thanks for the info!

s-wel commented 1 year ago

I think this feature is even much more relevant now. Consider that, for instance, many future research projects will need alternatives for Google TTS (which is quite strong in SSML) because of the European Union's ambition to strengthen solutions that contribute to trustworthy AI. Is anyone capable of outlining what the specific bottleneck for this feature is? What makes it difficult to implement?

thetrebor commented 1 year ago

i really liked where this is going. what are the chances it could get merged wtih main so we can continue with it?

jav-ed commented 1 year ago

Having the possibility to add user defined pauses in the speech would be great.

MesumRaza commented 1 year ago

Any update?

erogol commented 1 year ago

Nobody's working on it from the dev team.

Hunanbean commented 1 year ago

For what it is worth, i consider this absolutely essential. I hope very much that this is re-opened and worked towards. you kind of quickly hit a brick wall of what you can do without SSML present

vodiylik commented 1 year ago

Any updates? 👀

grabani commented 1 year ago

Still hanging around to see if there is any progress...

jgsaez9 commented 10 months ago

Any update??

genglinxiao commented 10 months ago

Do we have basic support of SSML now? Or is it only supported in the Gruut branch?

erogol commented 10 months ago

No we don't have SSML and no timeline for it unless someone is contributing it.