Real-time lip sync for Cosplay

marcwolf1960 commented 6 years ago

Hi. I found your Github with the Rhubarb Lip Sync app on it and I was wondering if you could give me some advice.

I do animatronics for Cos-Play and other amateur/hobbies applications. One that I have been working on for a long time is a way to take a continuous real time speech stream from a microphone, and generate a number corresponding to the lip shape. From there a secondary processer can take this number and manipulate servo’s to give an approximation on an animatronic character.

For instance.. When they are making a movie where the main character is a werewolf consisting of an actor in the suit, and a team of puppeteers conforming the lips and facial expressions using remote control. All of this has to be carefully scripted or put in afterwards as CGI.

However take the same character and put them in a live performance situation i.e. a Cos-Play convention and you do not have that flexibility. I already have ways to pick up the actors facial movements underneath the costume. But getting a small self-contain Real Time Lip Sync is beyond me.

Could Rhubarb be compiled in Mono to run on an RPi or similar. The output being just a stream of numbers covering the lip shapes. The input would be a microphone worn by the actor.

Any suggestions would be greatly appreciated.

Many thanks Dave

DanielSWolf commented 6 years ago

Hi Dave,

Live lip sync for a Cosplay costume certainly sounds interesting! :-)

The problem

Let me start by saying that Rhubarb Lip Sync cannot do what you need. Rhubarb is not a real-time appliation but always works on "utterances" (segments of typically a few seconds length). To work in real time, one would have to rewrite the entire application.

In addition, even if you were to write a new application from scratch, I'm fairly certain that its results wouldn't really look convincing. Not matter how good your application is, the mouth animation would always appear to be late and out of sync.

Let me elaborate on this. Speech consists of phones (that is, basic sounds). For instance, the word "glue" consists of the three phones G, L, OO. People often assume that each phone has a single corresponding mouth shape. That isn't the case. What really happens most of the time is that the lips "anticipate" the next vowel, that is, they move early. Watch yourself in a mirror saying "glue". (Important: say it quickly, as in regular conversation; don't over-enunciate!) You'll notice that your lips immediately form the OO shape while you're still saying the G phone.

Your brain knows what the next vowel will be, so your lips will form it ahead of time. A program doing real-time lip sync cannot possibly know what vowel will follow. The word "glee", for instance, also starts with G and L. There is no audible difference in these consonants, but the mouth would have to form a completely different shape.

Here's a different form of the problem. Take the word "apparently". Obviously, the mouth has to be open for the first vowel, then closed for the P phone, then open again for the second vowel. The thing is that P is a so-called plosive consonant. That is, the sound isn't made with your mouth closed (like an M, for instance), but the sound is generated when your lips open. So to make a P sound, your mouth first has to close, then open, and then you hear the sound. That means that any program that performs real-time lip sync based on what it hears will be much too late. By the time it hears the P sound, the mouth is supposed to have closed and opened again. Closing it now will look severly out-of-sync.

Here are some more temporal factors working against you:

Unless you're working in hardware, you'll always have some latency in signal processing.
I don't have any practical experience with servo motors, but I guess it must take them in the order of 100ms to go from fully open to fully closed mouth.
There is a general rule in traditional animation: When in doubt, animate ahead of time, but never late. So if your animatronic mouth were always 100ms early, nobody would notice. Make it 100ms late, and it looks out of sync.

Solution 1: Optimization

Now that I've explained the problem, let me try to show two options that might work. First, if you are determined to use a microphone-based approach, what you need is real-time phone recognition. Googling this term will give you a number of scientific papers on the matter. It certainly seems possible to get it to work with minimal delay; see this video for a demonstration.

Once you receive a phone, you cannot afford to wait for additional phones in order to make more informed decisions. So you'll probably get best results with a simple lookup table that assigns one mouth shape to each phone. For inspiration, you can check out the source code of Rhubarb 0.2.0. This very early version still used such a simple lookup table. The mouth shapes A through H referenced in the code are documented in the current README file.

Solution 2: Monitoring the mouth

As we've seen, the mouth is always faster than audible speech. So I believe that a better approach would be monitoring the mouth. If there is enough room in the mask, you might be able to implement a camera-based system. First, apply some marker points to the mouth to simplify tracking. You only need the vertical and horizontal opening, so four markers may suffice. Then, use a macro camera to film the mouth at close distance, and a library like OpenCV for feature tracking. (Here is a video showing that the Raspberry Pi camera can easily be turned into a macro camera.)

If the mask is large enough that you can position the camera at a few centimeters from the mouth, this approach might work quite nicely. As an alternative, you could use some kind of mechanic sensors. I have no experience with this, but I'm confident that there are sensors which allow you to reliably track the movements of four points on the cosplayer's face.

marcwolf1960 commented 6 years ago

Hi Daniel. Many many thanks for your valued comments. I am well aware of many of the issues with Phoneme processing and so I have been keeping my fingers crossed for both faster smaller processors, and better predictive algorithms. I am really trying to work in a very approximate movement like pursing of the lips, pressing lips together, and making a circular move.

Yes - servo's are slower and I realize the lag issue. I ab hoping that the actor could speak slower and so to give the servo's a chance to keep up. This could be done by practicing in a mirror.

The main advantage of a phoneme based system is that it can be made portable and not designated to any users or costume. So if an animalistic cosume has a short muzzle like a cat, or a long muzzle like a wolf.. both will work.

Monitoring the mouth. Thanks for the links on modding th Pi camera. The camera can also be unseated from the board and hung just by it's ribbon cable. Again the points re how much space and also how much vision of the mouth one has to work with. I have tried some experiments using fluorescent markets illuminated by UV LED's and that ran in to issues re the size of the mouth. When fully open the mouth can 2" x 2" and trying to find a macro lens that can work with that size and at a distance of 1.5" is tricky. I do not know who to contact on that one. I also experimented with fisheye lenses but again the issues of distortion of the edges became too much. It's not as easy as James Cameron's Avatar where they have a lot more room and a clear view of the face.

Another approach I tried was to affix small ir leds to the lips.. but finding a glue that was flexible enough and could be applied fast was not easy. Even looked at silicone lip shields https://www.youtube.com/watch?v=GUZFKgFJXR0 Although you can take 4 readings at a time, one can have 8 or more leds and sequence 4 on and 4 off etc.

Additional types of sensors are a possibility. I have designed my own low cost, low force, and low profile linear sensors that are very fast and responsive. I can get a resolution of under 0.2mm from them.

Again many thanks for your sage advice.. :)

DanielSWolf commented 6 years ago

I'm glad if I could help. I'm closing this issue for now -- let me know if there is anything more I can help you with.

And please give me an update when you make progress! I'm finding this topic very interesting, and I'm curious to know what solution you'll find!

DanielSWolf / rhubarb-lip-sync