Closed betterclever closed 7 years ago
@jyothiraditya @dynamitechetan @dravitlochan What do you think about server based implementation and should we go ahead with it?
Models for speech processing would need to be trained and we currently have a lot of open areas to solve. This might be a feature for the future, but not right now.
@mariobehling I think there is some misunderstanding. This issue is not about training speech models at all or making our own speech synthesis engine since it is neither easy nor our goal. It is just that instead of calling Google Speech API , then Susi Chat API, then Watson API slows down the work on a slow network if it is all done on client (raspberry pi for example). It was about moving it to server, so that you send recorded voice to server, Server calls Google Speech API there, gets text output, generate Susi Response , do TTS by calling Watson API there and send back the response (.wav file to be played) sent via Watson API to device. Since server runs on a Cloud Based Environment with High Speed internet and a moderately powerful CPU, there is time saved and it can be utilized by any Susi Client as well.
This is doable and I'll soon implement in my custom deployment to test if it helps. Is there any problem with this approach?
ok interesting idea, but if this 'just' about sending the speech to an server for stt, then it does not matter if you send it from the client directly to google/watson etc or to susi_server back-end. This data has to be send anyway, right?
And if you send the audio first to susi_server, then we have the add-on task to send the audio from susi_server then also to google/watson etc. That creates more load. Did I miss the point?
No, you are absolutely right. Actually, connection at my home sometimes takes long time in resolution so when I was performing frequent calls I saw a slowdown since there are multiple requests needed from client side. Thus, I was thinking if it could be improved by shifting everything to server, were not sure if it would help but what I thought that we need to send data once to server and get data back once. Server doesn't face much connectivity issues, and on Google Cloud shell when I tried it was almost instantaneous for both TTS and STT (on a 5-sec voice sample) so no major delays there. But I think it won't be of much help since we would need to manage other issues as well in such a case like proper syncing and re-sending back the audio in case of failure and as you said it should not affect speed much as well (if it would) or it may introduce delays as well. So, I think we should stick to current approach. Closing this therefore.
Currently the app is very slow. It takes considerable time between speech input and output. I dug down to find out cause and observe the same pattern on Alexa. The main cause of slow down is that we need to handle things one by one i.e. first we record the voice, then send it to Google for recognition , after getting hypothesis text, we send it to Susi server and it replies back text, and then we synthesize speech either on device or send it for synthesis to Watson which can cause even more slowdown.
How does Amazon handles that? Amazon Alexa devices send voice recorded to Amazon , to "Alexa Voice Service", the Alexa voice service does all the processing there and sends back a file with speech output , which is then played.
Can we incorporate the same in Susi? Yes, we can shift all the speech processing work to server and follow Alexa like approach. This also helps us to think in a cross language form, i.e. a web client can also record , using web apis for microphone, and then playback the file sent via server. Also, smart mirror project https://magicmirror.builders/ has modules written primarily in Javascript so we get an advantage there too.
It can be done using a servelet on Susi Server dedicated for Speech Processing using other APIs.
@mariobehling @Orbiter what are your views on this?