MycroftAI / mycroft-core

Mycroft Core, the Mycroft Artificial Intelligence platform.
https://mycroft.ai
Apache License 2.0
6.48k stars 1.27k forks source link

RemoteTTS "sentence" splitting issues #3132

Open emphasize opened 1 year ago

emphasize commented 1 year ago

https://github.com/MycroftAI/mycroft-core/blob/5f964f7d4036210c7e1977e6903d3f13421e2af2/mycroft/tts/remote_tts.py#L47-L48 https://github.com/MycroftAI/mycroft-core/blob/5f964f7d4036210c7e1977e6903d3f13421e2af2/mycroft/tts/remote_tts.py#L60-L65

The splitter in its current form is suboptimal, since phrases that aren't touched by Lingua-franca format functions (e.g. Die aktuelle Uhrzeit ist 22:20:38 Uhr. Mitteleuropäische Sommerzeit, Dienstag, 27. September 2022.) are split into ["Die aktuelle Uhrzeit ist 22:20:38 Uhr", "Mitteleuropäische Sommerzeit, Dienstag, 27", "September 2022"]

Please make it/them accessable to the class subclassing RemoteTTS. (if it has to stay in the current form)

for instance:

class Sub(RemoteTTS):

  @staticmethod
  def _get_phrases(sentence):
      return [sentence]
JarbasAl commented 1 year ago

i deprecated this class in a backwards compatible way in OPM, i removed the split and am using requests directly

https://github.com/OpenVoiceOS/OVOS-plugin-manager/blob/dev/ovos_plugin_manager/templates/tts.py#L984

krisgesling commented 1 year ago

Yeah I can't see any reason that a subclass shouldn't be able to override it. Are you able to do a little PR for that?

But Jarbas raises a good question, should we allow each plugin to split sentences in its own way or perform it consistently at a higher level?

I think we should consider that independently of this particular (rather rudimentary) implementation of sentence splitting. Though something tells me that "sentences" are not universal enough to handle it in a single way.

The first thing that comes to mind are Spanish questions "¿Cómo te llamas?", but I'm sure there would be other examples that aren't as easy to account for.

emphasize commented 1 year ago

There are some additional problems with RemoteTTS, since you essentially bypass caching and the Queue (consequently making pulse_duck: True ineffective).

emphasize commented 1 year ago

In RemoteTTS adding/rewriting

from .cache import hash_sentence

def execute(...):
    # for part in self._get_phrases(sentence): ## gets tossed
    response = self._request(sentence).result()
    if response.status_code == 200:
        input_file = self.__cache(sentence, response.content)
        viseme = None
        self.queue.put((self.audio_ext, input_file, viseme, ident, listen))
    else:
        LOG.error(
        '%s Http Error: %s for url: %s' %
        (response.status_code, response.reason, response.url))

def __cache(self, sentence, data):
    if self.config.get("cache", True):  # let configs opt out
        sentence_hash = hash_sentence(sentence)
        file = str(self.cache.define_audio_file(sentence_hash).path)
        self.cache.cached_sentences[sentence_hash] = (file, None)
    else:
        file = self.filename

    with open(file, 'wb') as f:
        f.write(data)

    return file

should do the trick.

emphasize commented 1 year ago

Maybe we should make split_in_sentences also a config option? (Why is it a staticmethod atm, where is this used as such?)

def _get_phrases(self, sentence):
    if self.config.get("split_in_sentences", True):
        phrases = re.split(r'\.+[\s+|\n]', sentence)
        phrases = [p.replace('\n', '').strip() for p in phrases]
        phrases = [p for p in phrases if len(p) > 0]
    else:
        phrases = [sentence]

    return phrases

I would do so in the coqui plugin, yet this could be universal and thus less code needed.

Splitting up might be better experience upward a certain RTF.

emphasize commented 1 year ago

... should we allow each plugin to split sentences in its own way or perform it consistently at a higher level?

If you're running anything other than "enclosure": "picroft" and the string is not SSML, this is already the case. Just wondered why the config key is disrespected.

JarbasAl commented 1 year ago

there is a split hardcoded in core for picroft, a split hardcoded for RemoteTTS, and mimic2 has its own split logic also

I think this split will be very plugin specific and is not one of the things that should be handled in a one size fits all fashion, my recommendation is to allow each plugin to split things if and how it wants, not to enforce the split.

in ovos I removed all this split logic, it never helped in practice and often caused weirdness, as an anedoctal example this has changed pronounciations completely in some cases simply because the TTS performs better with the full utterance as context

forslund commented 1 year ago

Totally agree with @JarbasAl, that was the idea behind making the split in .preprocess_utterance() so any skill can override as needed (but still providing some sort of sane(R) default that should(TM) be usable in most cases).

The initial sentence splitting was to increase responsiveness for mimic1. (process one sentence was faster than a whole wikipedia article). and then audio would crash on picroft if they were too long so additional splitting was added there too (I think that has been resolved on the platform and removed from Mycroft now). After that Mimic2 had issues with too long text, and needed extra utterance processing.

I do think the sentence splitting for increased responsibility is generally good thing (at least for the TTS architectures I've seen where audio is generated as a final step) but as Jarbas says, it's a tradeoff for quality...

emphasize commented 1 year ago

We're coming closer to the genealogy. So, the check should be: (as it is indicated by the TODO)

if tts.config.get("split_in_sentences", True): 
    chunks = tts.preprocess_utterance(utterance)
    ...

with RemoteTTS tossing

def _get_phrases(...)

and related code

and the respective plugin overwriting preprocess_utterance() if they come up with/need something special. Yet, i would make the default Splitter a part of mycroft.util.x