DanielSWolf / rhubarb-lip-sync

Rhubarb Lip Sync is a command-line tool that automatically creates 2D mouth animation from voice recordings. You can use it for characters in computer games, in animated cartoons, or in any other project that requires animating mouths based on existing recordings.
Other
1.72k stars 208 forks source link

Phonetic recognition #45

Closed DanielSWolf closed 5 years ago

DanielSWolf commented 5 years ago

Rhubarb Lip Sync uses word-based speech recognition. That works well for English dialog. For non-English dialog, however, phonetic recognition might work better. Rather than try to extract English words from non-English speech, this will extract phonemes.

I'm planning to add a CLI option to switch to phonetic recognition.

This is only a temporary solution. In the long run, I still plan to implement full (word-based) recognition for languages other than English (see #5).

DanielSWolf commented 5 years ago

I created a feature branch (feature/phonetic-recognition) to get some feedback. Here are the instructions I wrote on the Thimbleweed Park forum:

The basic idea of my hack is this: Normally, Rhubarb tries to recognize whole words (and phrases). Since Rhubarb only knows English, it has a hard time finding English words and phrases that resemble the Italian dialog. That's why the results are often rather inaccurate.

My hack simply lowers the granularity. Instead of looking for whole English words and phrases, it now looks for English phonemes and syllables. So the underlying language model is still English, but the chances that a given Italian phone is similar to an existing English phone are pretty good. And the chances that a given Italian syllable has a matching English syllable are still not bad.

There is, however, still some fine-tuning to be done. If Rhubarb only worked at the syllable level, there would still be many Italian syllables it couldn't match. As a result, the animation would look wrong in those places. Worse, the mouth could even stop moving for a moment if Rhubarb really couldn't find a suitable match.

The obvious solution would be not to work at the syllable level, but only at the phone level. Most Italian phones are also present in the English language. The problem here is fluttering. If the voice actor is saying a long phone that's exactly between two known English phones, Rhubarb might first recognize phone A, then phone B, then A again and so on, while actually the speaker is still saying the same sound. As a result, the animated mouth might flutter between several shapes during a single phone, which looks quite bad.

The solution, then, is to blend the two approaches. I've temporarily added an additional (mandatory) command-line argument modelWeight. If you specify a high value (such as 2.0), Rhubarb will try to recognize whole syllables, leading to imprecise or freezing animation. If you specify a low value (such as 0.1), Rhubarb will try to recognize individual phones, leading to fluttering. I found that the value 1.0 seems to work well, balancing the advantages of both approaches. But I didn't try any other values between the two extremes. So maybe something like 0.8 or 1.3 could work even better. Also, I only tried the new approach with a short one-minute dialog containing Italian, Spanish, French, and German. Trying it out on a larger body of recordings may give additional insights.

My plan is to settle for a fixed model weight before the release. Then I'll add a new command-line option to switch between the original, word-based recognition (which looks best for English) and the phonetic recognition (which will hopefully work better for non-English dialog).

Let me know what you think! I'm grateful for any feedback. And if you found a modelWeight value that seems to work better than 1.0, let me know.

Note that this feature branch may change at any time.

morevnaproject commented 5 years ago

Thank you very much!

nshmyrev commented 5 years ago

You should try something like montreal-forced-aligner, it supports many languages out of box

DanielSWolf commented 5 years ago

@nshmyrev Thanks for the reference; that project looks interesting.

However, this issue is about phonetic recognition, while Montreal Forced Aligner is about forced alignment and G2P. Am I missing a connection?

DanielSWolf commented 5 years ago

I've extracted all the speech recognition logic into an interface called Recognizer, so that recognizers can be selected via CLI. To start, I implemented two recognizers: pocketSphinx is the old English recognizer; phonetic is the new, language-agnostic recognizer.