Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx

DanielSWolf / rhubarb-lip-sync

Rhubarb Lip Sync is a command-line tool that automatically creates 2D mouth animation from voice recordings. You can use it for characters in computer games, in animated cartoons, or in any other project that requires animating mouths based on existing recordings.

Other

1.85k stars 222 forks source link

Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

Open towfiqi opened 5 years ago

towfiqi commented 5 years ago

Rhubarb is a little slow, I think its becuase the time pocketSphinx take to recognize the words in the audio file. In my project, I will use google speech to text, which will output a something like this:

[
             {
                "startTime": "1.300s",
                "endTime": "1.400s",
                "word": "Four"
              },
              {
                "startTime": "1.400s",
                "endTime": "1.600s",
                "word": "score"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.600s",
                "word": "and"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.900s",
                "word": "twenty"
              }
]

What file/function do I look into inside this repo to see how rhubarb converts the words to mouth shapes directly. I mean what function does rhubarb uses to convert words to shapes eg:

0.00    A
0.05    B
0.63    C
0.70    B
0.84    F
0.98    D

I looked into rhubarb-lip-sync/rhubarb/src/core/ and could not figure it out. So Far I understand that, the words are being converted to "DARPA phonetic alphabet" first and then converted to mouth Shapes. eg: AA becomes Shape A, EH becomes shape C etc..

Can you kindly provide a high level overview of what process rhubarb follows to convert words to shapes?

Thanks

DanielSWolf commented 5 years ago

The best overview you'll find is in /src/lib/rhubarbLib.cpp. The first step is getting time-stamped phones; the second step is animating them.

Google won't give you the timing of individual phones, only words. So you'll have to enter at a lower level of code, somewhere within /src/recognition. You'll have to replace the part recognizing the words, but not the part converting words to phones and aligning them.

One problem I foresee is that Google STT won't reliably give you individual words. Try recognizing "$142". PocketSphinx will give you words: "one hundred and fourty two dollars". Last time I tried it, Google attempted to be clever and returned "$142". There are three problems here:

"$" is not a word, but a symbol. Rhubarb won't be able to convert that into phones.
"142" is not a word, but a series of digits. Same problem.
The order of $ and 142 is reversed compared to what is actually being said.

Unless Google changed their output, your approach will likely fail every time numbers, currencies, or dates are involved. If this is no longer the case, please let me know. I'ld love to integrate Google SST with Rhubarb.

towfiqi commented 5 years ago

The dollar issue can be mitigated by converting the $ to string "dollar" and the number to words with something like this: https://www.npmjs.com/package/written-number before sending it to Google STT.

As for the timestamp for each phones, I was planning to divide the duration with the number of shapes, This is how I plan to complete the whole process:

First get the result from Google/IBM speech to text and get the result. For example, the word Four is uttered in 0.1 seconds by Google STT, and the output is:

{ "startTime": "1.300s", "endTime": "1.400s", "word": "Four" }

Next, find the Arpabet for "Four" incmudict.0.7a, which is :F AO R .

Next, convert F AO R to X G E H X (5 shapes). Can be easily done, since each Shape correlates to only a few alphabets.

Then divide the duration (0.10) by 5 and give each shapes 0.02 seconds. like this:

0.00    X
0.02    B
0.04    C
0.06    B
0.08    F
0.10    D

I know the animation won't be perfect, but since, 95% characters built in Spine is 2d ish, it won't be that much noticeable..

If it works, the great thing about this would be that it can be easily used on the fly which my project depends on.

Let me know what you think. Thanks

DanielSWolf commented 5 years ago

For simple cases, this should work. Beware, however, that there are many special cases that won't be covered. To list a few:

Most speech-to-text APIs will give you denormalized text. Dollars and numbers are just examples. In general, you'll need to normalize their output.
From the (few) tests I did on Google SST, their word timings are sometimes way off.
Words may not be in the dictionary.
Some phones within a word may be much longer than others.

At the end of the day, it all depends on your requirements. The more control you have over the aspects I mentioned, the less of a problem they may be.

towfiqi commented 5 years ago

I will keep that in mind. Can you kindly look at the below list and see if I have placed the mouth shapes correctly:

    /*A*/ ["P", "B", "M"],
    /*B*/ ["K", "S", "T", "EE", "IY", "IH"],
    /*C*/ ["EH", "AE", "AH", "Schwa", "EY", "AY", "HH", "G", "CH", "JH", "R", "Y"],
    /*D*/ ["AA"],
    /*E*/ ["AO", "ER", "SH", "ZH"],
    /*F*/ ["UW", "OW", "W", "UH", "AW", "OY"],
    /*G*/ ["F", "V"],
    /*H*/ ["L", "N", "NG", "T", "D", "TH", "DH", "S", "Z", "D"],

Thank You

DanielSWolf commented 5 years ago

Rhubarb's animation algorithm is more complex than a simple lookup. For a time-tested lookup table, I recommend you have a look at Papagayo or Papagayo-NG.

towfiqi commented 5 years ago

I will look into it. Thanks for all your help! :smiley:

lukas-mertens commented 5 years ago

@towfiqi Did you make any progress on this? I would be very interested in this as well, because it could be a great way on how to make rhubarb work with different languages (see #5).

towfiqi commented 5 years ago

@lukas-mertens The simple array lookup was suffice for my project. So I am using the method I described above. For different language just swap the cmudict.0.7a English dictionary and with other language dictionary. https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/fst/ .

lukas-mertens commented 5 years ago

I am actually doing that already right now. But it doesn't work that great, because the precision is rather bad. Sometimes the mouth moves when nothing is spoken and the other way around. Did you get any better results by using google speech to text? I tried out the API and got very good speech recognition results with my files (only about one wrong word every few sentences). That's why I was thinking about using this API by Google to get a precise timing of every word. Additionally I found out that espeak supports conversion of languages to IPA:

cat script.txt | espeak -q -v de --ipa > phonetics.txt

If I would have a lookup table from IPA to the mouth shapes by Rhubarb, I could get quite precise results I believe. @DanielSWolf I don't know how the integrated phonetics recognizer works, but could this probably be combined? Maybe you could use espeak to convert from text to IPA for many languages and use google to at least make the words match up with when the mouth moves.

towfiqi commented 5 years ago

You can try IBM watson speech to text, you can get exact timestamp for each word which google speech to text doesnt.

And rhubarb uses pocketsphinx which not that great. It may output good result if the spoken word in audio is good quality and pronunciation is great.

For my project, my requirement is actually to get text to mouth shape animations. Which is why I dont actually have to output audio to text first. I already have the text.

If you could share your method of text to mouth shapes, that would be great!! Because on my try I got lots of close mouth in between words which made the animation look bad.

Edit: I just checked our link. looks like Google STT outputs timestamp, which I missed when I was researching.