Improve vocal response speed

kallisti5 commented 6 years ago

The slow vocal response speed is a huge blocker to daily use of Mycroft. Queries normally take 8+ seconds before any initial response... then additional responses filling the request can take 20+ seconds.

If the problem is python, please adopt another compiled + faster language for mycroft-core (golang, rust, etc). If the problem is elsewhere like with the TTS engine, then let's fix it. A 3-4 second initial turn-around time is max for a reasonable conversation.

<random rant> The new Mycroft hardware seems to put more focus on using an LCD... most people don't want "another screen". The whole point of a vocal assistant is the vocal part. I don't want to see Mycroft become another failed Chumby. </random rant>

penrods commented 6 years ago

Thanks for the feedback, and we totally agree, which is driving much of our current work. We are working on several things: 1) The current TTS engine, Mimic, is very good from privacy standpoints in that it can work completely locally, even on a Raspberry Pi hardward. However, it runs at about 1-1 speed, meaning that a 5 second utterance takes 5 seconds to generate -- meaning 5 seconds of silence in the meantime. Our work on Mimic2 will provide an off-device solution that can be significantly more natural AND faster. The Mark II potentially will also be able to run this software on the local FPGA (no promises about that yet). Together, these alone should improve the experience. 2) The current STT mechanism is "batched" not "streaming". DeepSpeech is beginning to add support for this method, which we can take advantage of to further reduce the response time.

As for Python vs other languages -- the Mimic system is already C and highly optimized. The STT systems are also running on GPU hardware and such via TensorFlow, which is highly optimized. The Python components aren't a source of delays, so switching languages would do little to no good.

As for your rant, I respectfully disagree. First, as I pointed out above, the Mark II adds processing capabilities that the Mark I didn't have, opening up potential for voice interaction. Additionally, when I started building a voice agent some time ago I rapidly discovered that a sizable portion of the interactions DO benefit or simply need a screen in order to convey information. The current weather is good verbally, a 10 day forcast requires a visual; cooking timers are really handy to control verbally, but a visual tracker is really wanted to see how much time is left at a glance; you don't want to stop the music to see what song is playing; etc.

However, I do agree that I don't want more screens when not needed. We are working on technology and partnerships with others who are connected already to larger screens (such as TVs or laptops).

I hope this doesn't sound dismissive -- your thoughts aren't far from ours! We'd love to have help in this if you'd like to join in the development efforts.

kallisti5 commented 6 years ago

As for your rant, I respectfully disagree.

I knew it came off kind of ranty. I think as long as the vocal response times are improved things would be a LOT better (regardless of hardware implementation). The biggest source of the rant for me was "why are they working on screens when the core voice response is so slow".

I really do love the idea of Mycroft being open and extendable in an open way. I was an early adopter of the Chumby, and just don't want to see Mycroft go down the same path :-)

I noticed the other day, our Amazon Echo took some big AI steps recently and can respond to questions almost as well as Google's voice assistant.

el-tocino commented 6 years ago

The pi is a terrific cheap computing device but it's just not a high performance device. If you want more performance, try running mycroft on beefier hardware. Using default everything from a new install, it takes 4-6 seconds from recognition to completing a response on my desktop (xeon, 16gb). A google home is perhaps a second faster for the same questions. (edited: after log review, as low as 4s)

penrods commented 6 years ago

First, no worries about ranting -- it didn't come across as mean and I took it for what I think you intended.

Fortunately/unfortunately this isn't the sort of problem where I can just focus my thousands of software engineers on one part and come out with a solution in a few weeks. This is a big problem and we are working with limited resources (I don't have thousands of engineers who work directly for me) and disparate groups running on their own timelines. So some parts are improving at a faster pace than others.

Also, the GUI piece is actually easier and will appear to develop faster since it involves well-known challenges -- computers have had screen interfaces for the last 40 or 50 years and we are pretty good at abstracting out things like windows, buttons, dialogs, etc. and defining output with things like HTML/CSS that is rendered with existing tools like WebKit.

P.S. I'm working to make Mycroft a technology that is capable of living on far longer than Chumby and hopefully longer the me, too!

mehmetaergun commented 5 years ago

I notice that Mycroft is very fast to respond, if the response does not involve a speech response. If it does involve both speech and non-speech, then it will execute the later very fast but the former with a 5-10 second delay, which makes any speech interaction useless.

The weather skill on mark 1 is a good example. It will instantly display the weather on the mouth (non-speech) but will take a while to speak it.

I don't think this involves a problem with raspberry pi specs but the time it takes to construct it's speech.

A workaround could be to cache common responses (e.g. static parts of *.dialog files in skills) by default, and cache more and more of its responses (and dynamic parts in the .dialog files) as the user interacts with it more.

penrods commented 5 years ago

The Text to Speech (TTS) is actually cached at several levels. On an individual device, every spoken phrase is cached in temporary storage and will be reused if it is called for again. This cache is limited however, and it will be cleared on each reboot.

Additionally, the server-generated voices (such as the "American Male" aka "Kusal" voice) have a global cache. So once an individual user has requested a generated version of "It is 7:38", any other user who requests that same phrase will get the cached version -- skipping the GPU generation phase. So the delay is just the network transfer. The odds of hitting the global cache will get better as Mycroft adoption increases.

krisgesling commented 4 years ago

Hi there, speeding up Mycroft's response times is a continuing effort.

As this issue hasn't had a response for a while it will be closed.

Thanks again for your feedback.

MycroftAI / mycroft-core

Improve vocal response speed #1702