AwesomeTTS / awesometts-anki-addon

AwesomeTTS text-to-speech add-on for Anki
GNU General Public License v3.0
480 stars 100 forks source link

Reading Hints #127

Closed ebracho closed 4 years ago

ebracho commented 4 years ago

When working with raw SSML tags its possible to indicate the reading of words in cases where it's ambiguous using the <sub> tag. For example (using Azure):

<speak  xmlns:mstts="http://www.w3.org/2001/mstts" version="1.0" xml:lang="jp">
  <voice name="ja-JP-NanamiNeural">
    <prosody rate="0%" pitch="0%">
      彼女は毎日<sub alias="いちば">市場</sub>に買い物に行きます
    </prosody>
  </voice>
</speak>

indicates that 市場 should be read as いちば rather than the default reading しじょう. As far as I can tell, the <tts> tag interface is meant to work with raw text and strips away any inner html, so there's no way to pass this information through to the underlying service. Am I missing something or is this a known limitation?

ebracho commented 4 years ago

Proof of concept: https://github.com/ebracho/awesometts-anki-addon/commit/faea6240ef9ecf66424977892b49f72e45b1a679

This is by no means a complete solution (removing that html sanitizer rule probably breaks other services) but it achieves the behavior I'm going for specifically for the Azure service by passing through unsanitized <sub> elements within a <tts> element. Ex:

<tts preset="my-azure-preset">彼女は毎日<sub alias="いちば">市場</sub>に買い物に行きます</tts>

A cleaner approach would probably be to replace the html filter with something that only allows a small subset of tags through, perhaps configurable as a Service trait or as an option attribute in the <tts> element.

Another issue is that the Anki html renderer interprets <sub> as subscript which makes the text come out weird. I'm not sure what can be done about that.

luc-vocab commented 4 years ago

Providing access to SSML tags would open a lot of possibilities for some voices (https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp) Just to confirm, you're planning on batch-generating audio files ?

Can you elaborate on what your use case is ? Is it a matter of correcting the pronunciation in cases where the default is not suitable ?

As for the display issues, you'll need to have one field for display, and another field for the SSML string. AwesomeTTS users are already doing this in cases where the string passed to the TTS engine needs to be different.

ebracho commented 4 years ago

Just to confirm, you're planning on batch-generating audio files

Not for now - I'm just getting started with sentence mining for my first deck and so far I've just been creating cards manually.

Can you elaborate on what your use case is ? Is it a matter of correcting the pronunciation in cases where the default is not suitable ?

Yes exactly. In Japanese specifically pronunciation of a phrase is often times ambiguous and requires context to interpret correctly. For example, the phrase 「今日は」 is most commonly read as "today" ("Kyō wa"), but in some formal writing contexts could be read as "hello" ("Kon'nichiwa"). This article (Optimizing Japanese text-to-speech with Amazon Polly) goes into more detail about the challenges of synthesizing Japanese speech and the solutions that AWS Polly's dialect of SSML offers to deal with them.

As for the display issues, you'll need to have one field for display, and another field for the SSML string.

Oh that should work perfectly, thanks!

luc-vocab commented 4 years ago

Also, can you solve your problem by having two fields: one japanese field, which is the true written form, and the one you'll see displayed, and one "japanese-tts-pronunciation", which will be the one fed to the TTS engine ? That's how I fix TTS issues with cantonese.

ebracho commented 4 years ago

Also, can you solve your problem by having two fields: one japanese field, which is the true written form, and the one you'll see displayed, and one "japanese-tts-pronunciation", which will be the one fed to the TTS engine ? That's how I fix TTS issues with cantonese.

For my use case I think I need to be able to pass those pronunciation hints through to Azure regardless of whether its rendering to the card. I was able to unblock myself by forking and patching though, so feel free to close this issue!

luc-vocab commented 4 years ago

My question was: can your problem be solved by having a secondary field which contains the string to be fed to the TTS engine ? Or is SSML a must in your case ? if your patch is widely usable, you can feel free to submit it

ebracho commented 4 years ago

My question was: can your problem be solved by having a secondary field which contains the string to be fed to the TTS engine ? Or is SSML a must in your case ? if your patch is widely usable, you can feel free to submit it

Ah I see what you meant now. I think a secondary field with the phonetic version of the word would also work. My patch is definitely not widely usable - I just deleted the html sanitation rule without considering how it would affect other parts of the code.

luc-vocab commented 4 years ago

Closing for now as the immediate problem seems fixed.