Faster Result Generation

lelelelem commented 6 years ago

Just want to ask if there is actually a way to speed up result generation? Since --threads flag is more like to limit the threads. Is there any way or configuration that might speed up the process? Thanks man!

DanielSWolf commented 6 years ago

Rhubarb is already optimized for speed. For Thimbleweed Park, it was repeatedly used to animate around 16,000 lines of dialog within a few hours.

What exactly are your requirements?

lelelelem commented 6 years ago

Hi Daniel. We are actually doing a chatbot that has a virtual face. When a user speaks, the chatbot thinks of a reply then after, that reply is passed to a text 2 speech api for synthesizing afterwards the audio is passed thru rhubarb so we can get the mouth shapes and use it for animation of the chatbot's virtual face.

It's really working perfectly but the problem is that for an audio which has a duration of 3 secs rhubarb averages around 10 secs to generate the json file, this is a problem for us since we want the bot to reply as fast as it can. So if there is anyway we can speed up the generation that will be great but really I appreciate your work as it was the only possible solution I saw for our project.

DanielSWolf commented 6 years ago

I see. In that case, it may indeed be possible to speed up the process significantly. Let me elaborate.

What Rhubarb does can be divided into two steps. First, it needs to determine which sounds (phones) are being said at what precise points in time. This process takes the vast majority of the processing time you observed. Second, it performs the actual animation. This step is very fast.

Given that you use text to speech (TTS), it may be possible to find a shortcut. Many TTS systems optionally give you metadata, including the exact timing of the words or even phones they generate. If your TTS system supports this, it should be possible to hack Rhubarb to accept the metadata as input instead of the actual sound file. This way, processing should be almost instantaneous.

What TTS system are you using?

lelelelem commented 6 years ago

We’re actually using IBM Watson’s TTS. I also looked into that possible solution but looks like they don’t have a way to return phoneme timings (or even at least word timings). Word timings are actually possible via STT of IBM watson, as such will word timings suffice?

Really appreciate you taking your time to help!

DanielSWolf commented 6 years ago

Accurate word timings should reduce the processing time by at least 50%, if not more. Rhubarb could skip voice activity detection and speech recognition and would only have to do single-word forced alignment, which should be pretty fast.

lelelelem commented 6 years ago

Oh! That will be great! Havent really dug deep in your code, its already past midnight here on our side will try to check that out tomorrow.

If its not much of a bother, what modules will I be needing to tinker with?

DanielSWolf commented 6 years ago

Here are my assumptions:

You want a quick-and-dirty solution
Your TTS tool generates a text file that contains the exact start and end timestamp for each word. Also, the words are normalized. That means that the input "John has $2 million" gets turned into ["john", "has", "two", "million", "dollars"], each word with timestamps.
You pass this special text file to Rhubarb as dialog file

Under these assumptions, the only function you'll need to change is recognizePhones in /rhubarb/src/recognition/phoneRecognition.cpp. This function gets the audio clip and dialog file and is expected to return a timeline of phones.

Here's the idea:

Keep the function code up to here; delete the rest
Create the result timeline like this
Start off single-threaded with a single decoder.
Call addMissingDictionaryWords to make sure each word is contained in the decoder's dictionary
For each word:
- Determine the word ID via getWordId()
- Create an audio buffer containing only the single word by calling audioClip.getTruncatedRange(), optionally padding a bit (like this)
- Call getPhoneAlignment(), passing a vector containing only the single word ID and the single-word audio buffer
- Copy the aligned phones to the result timeline
Return the result timeline

Good luck! Let me know how it goes!

lelelelem commented 6 years ago

Thanks for the help! Really appreciate it. It looks easy enough, and really appreciate your work! Will be updating you once i get some progress.

lelelelem commented 6 years ago

Hi Daniel! We will be switching to Amazon Polly instead, since their TTS also have viseme detection so we instead will be using the capability. Really thanks for the help! Will be closing this issue now.

DanielSWolf / rhubarb-lip-sync

Faster Result Generation #31