Technology demonstrator - automatic lipsync system

hatanasinclaire commented 2 years ago

example gif

Overview

The following is a description for an automated system designed specifically to automatically determines a sequence of face shapes ("visemes") from Monika's dialogue. The idea behind this system is that the face shapes do not have to be specified manually for each line of dialogue, and therefore does not require changing tens of thousands of lines of existing dialogue code.

This system performs a series of steps to convert text to facial shapes:

The line is regexed to remove things like timing information. Numbers are converted to text using num2words and some punctuation is stripped out.
The regexed line is converted into its English pronunciation. Pronunciation in English is often represented using the International Phonetic Alphabet (IPA), a standardized system used to describe sounds in human speech. The system uses eng-to-ipa's convert() function to convert the text to the IPA representation of how it is pronounced.
Because letters do not correspond directly to sounds in the English language, eng-to-ipa cannot catch the pronunciation of every word. A secondary function is used to manually specify pronunciation for some of the words that eng-to-ipa does not recognize. Failing this, the word is skipped if no pronunciation is available.
The IPA pronunciation of the line is converted into the appropriate facial shapes using a lookup table. Though there are around 40-50 different sounds ("phonemes") commonly used in English, there are only half as many different facial shapes ("visemes") as some sounds will make the same shape. For example, the mouth makes the same shape when pronouncing cat and cut.

I identify nineteen distinct mouth shapes corresponding to fifty phonemes. Brand new mouth sprites for these are included.

Linked below is a hastily thrown together demo that utilizes pygame to preview how this looks with an arbitrary input. Please try it out!

Next Steps

This system currently has no way to connect directly to MAS. This is where I will need some developer help.

The first step the system needs to take for each line of dialogue is to get the dialogue actually displayed. Booplicate points out that mas_core._last_text can be used to find the current line.

The final step the system needs to perform is sending the sequence of visemes to Monika's sprite object then displaying them on screen. This part will require more substantial addition to the sprite system, and will likely constitute much of the remaining work to be done. To any devs interested in incorporating this into Monika After Story, I am available for further discussion regarding this here or on Discord.

Package Requirements

re - which of course is already prevalent in Monika After Story.
eng_to_ipa
num2words

Limitations

The system only works for the English language. Extending it to other languages would require both finding a Python library to convert that language to IPA as well as a new lookup table to convert phonemes to visemes.
Depending on how mas_core._last_text works, variables may or may not be stripped out of the text. But I am not too familiar with exactly what it outputs.

Code

The code, along with the sprites and a demo program can be found at this repository.

External Links

eng-to-ipa num2words

The phoneme-to-viseme conversion is informed by existing viseme systems but ultimately is unique to this prototype. Face the FACS Oculus Viseme Reference Microsoft Azure Cognitive Services

Booplicate commented 2 years ago

I think this is cool, but we would need to know what licenses eng_to_ipa and num2words use, and what version of Python they require.

I can't find any info on the eng_to_ipa page, it was made in 2020, so probably some kind of Python 3, but no license. num2words seems to be Python 3 only, but also uses LGPL-2.1, which is not ideal because we use a different license (but it might be okay).

If this is Python 3 only, it's fine, we will soon migrate.

hatanasinclaire commented 2 years ago

eng-to-ipa says it has an MIT license on its Github repo, but I could not find any information on version compatibility. I was able to get the demo to run in a Miniconda Python environment running Python 3.9.12 (not sure if this is helpful), however.

num2words can be replaced with inflect which has an MIT license and requires Python 3.7 or higher, so it will work after the migration. I will work on testing inflect and replacing num2words with it.

hatanasinclaire commented 2 years ago

It seems to me to me that the implementation of this will require a new class MASMoniTalkTransform() to be described in sprite-chart.rpy, similar to the existing class MASMoniBlinkTransform() and wink transform. This will be called in sprite-generation.rpy within generate_normal_sprite(). Does this sound correct?

ThePotatoGuy commented 2 years ago

Aside from lib licensing, hows the performance of this? Probably should wait until r8 at minimum but rapid sprite rendering might be too much.

This also should be optional/we would not be replacing the task of picking sprite code expressions, as the mouth is only 1 part of the expression, and we need static mouth selections for songs, piano, idle, etc...

Booplicate commented 2 years ago

We need to find a way to use dynamic and static mouths at the same time. It doesn't make much sense to always stay with open mouth now that it'd be animated, at the same time, in some cases we want it to be static or keep it open.

So it's something we should brainstorm.

Freezing the mouth would be rather rare, so it's something we'd need to enable explicitly
Maybe use a prefix for mouth code to make it static?
Maybe new mouth codes for dynamically animated sprites and keep the old codes for static?

Talking about songs, we need a way to change the speed of her speech.

hatanasinclaire commented 2 years ago

Aside from lib licensing, hows the performance of this? Probably should wait until r8 at minimum but rapid sprite rendering might be too much.

This also should be optional/we would not be replacing the task of picking sprite code expressions, as the mouth is only 1 part of the expression, and we need static mouth selections for songs, piano, idle, etc...

The mouth needs to update around 10 times a second to give the appearance of a normal talking speed. I believe this is slower than the blinking animation, which appears to have transitions of less than 0.1 seconds. However, I have not tested an in-game implementation yet.

This system does not replace the task of sprite code expressions but rather specifying the mouth manually is still necessary. The animation shows the mouth specified in the exp code after the mouth flaps are done. There are many instances where the mouth ought to stay open as how the exp code specifies. such as when she is making a big smile or is surprised. One potential area for expansion (thanks to Booplicate for this idea) is having another set of visemes with a wider open mouth when she is speaking and angry or surprised (such as exp code mouths b, o, w, x), and another set with a less open mouth for when she is speaking in a low voice (exp code has mouth c, d, or t)

ThePotatoGuy commented 2 years ago

ok, but in terms of rendering this would have to bypass spritecode system entirely because facial expressions are baked together when rendered, much faster to just have the other facial parts continue baking and this mouth be separate. so the way this defines the additional mouth sprites for this does not have to follow the same pattern as the current system unless the intention is to be usable as static sprites as well.

for control specifics, enable/disable functions are a must, but for inline options, maybe another keyword like m 1eua static or custom text tags. actually, having custom text tags control the speed and when its enabled is probably the most configurable option.

Booplicate commented 2 years ago

Side note, since it's relevant, with r8 I'm planning to overhaul Monika's displayable/draw function. It'll be much more effective as its own displayable. That means getting rid of baking her parts together and caching them ourselves, getting rid of current dynamic displayable. Along with big optimisation it'll give us ability to use shaders and any kind of displayables for Monika's parts (including acs, outfits, etc).

So each part of her can and probably will be some kind of sub-displayable, so using a special displayable for mouth would make sense, especially for something like this.

so the way this defines the additional mouth sprites for this does not have to follow the same pattern

if you mean only definition, then fine, but the syntax for usage should be close to what we already have. Animated lips will be used in most places.

for control specifics, enable/disable functions are a must

So I'd like to avoid syntax like

m 1eua "Hello!"
$ mas_enable_anim_lips()
m 1eub "World!"

it would be very inconvenient to use, even rarely.

maybe another keyword like m 1eua static or custom text tags

Maybe text tags to force static, yeah. But unsure about m 1eua static, since m 1eua_static is already a thing and it makes her eyes static. We can't use one keyword to control both eyes and mouth as both can be used independently. I thought about adding s to the mouth code, like b is the animated form of the open mouth with smile, but sb is the static form. But iirc we don't like sprite codes of ambiguous length.

having custom text tags control the speed and when its enabled is probably the most configurable option.

Not this one though, speed should be taken from the current persistent cps speed.

hatanasinclaire commented 2 years ago

Regarding the licensing - I tested out inflect and it works just fine as a replacement for num2words, though of course it will only work after the Python 3 migration. Is the MIT License for inflect and eng-to-ipa acceptable?

ok, but in terms of rendering this would have to bypass spritecode system entirely because facial expressions are baked together when rendered, much faster to just have the other facial parts continue baking and this mouth be separate. so the way this defines the additional mouth sprites for this does not have to follow the same pattern as the current system unless the intention is to be usable as static sprites as well.

The 19 new mouth shapes are not intended to be callable by exp codes / usable as static sprites so they can be assigned their own system.

maybe another keyword like m 1eua static or custom text tags

Maybe text tags to force static, yeah. But unsure about m 1eua static, since m 1eua_static is already a thing and it makes her eyes static. We can't use one keyword to control both eyes and mouth as both can be used independently. I thought about adding s to the mouth code, like b is the animated form of the open mouth with smile, but sb is the static form. But iirc we don't like sprite codes of ambiguous length.

Would it make sense to have a system like:

m 1eua_static makes only her eyes static
m 1eua_nolipsync makes only her mouth static
m 1eua_static_nolipsync makes both her eyes and mouth static?

Wingdinggaster656 commented 2 years ago

Just by the way, I find the current mouth animation a little scary. Or maybe it's because Monika takes up most of the image in the Gif. Maybe it will look a lot better in the actual game.

ThePotatoGuy commented 2 years ago

responses

Not this one though, speed should be taken from the current persistent cps speed.

using cps text tags means this could vary speed mid sentence - probably more realistic than Monika talking at a uniform speed.

if you mean only definition, then fine, but the syntax for usage should be close to what we already have.

my interpretation of what was previously said was that this wouldn't involve changing existing lines - so all current mouth expressions would just notify this system what set of non-sprite-code mouth sprites to use. aka no need to define new mouth sprite code strictly under the current system.

(my point is that we dont have space for 19 more mouth codes if they aren't going to be used in the same way - and it seems hatansinclaire is in agreement)

Animated lips will be used in most places...it would be very inconvenient to use, even rarely.

generally a good idea to make a toggle for features that significantly change the existing behavior - not everyone is going to want or like lipflaps.

overhaul Monika's displayable/draw function

talk to me before you do this

on static

that's true, we do already have *_static sprites, so how about some other less wordy options:

sm 1eua - make new character object sm for "Static Monika". would mean manual show/hiding unless we did something fancy with the disps.
m _1eua - can pick a letter/letters to add to the start of a sprite code - avoids the ambiguity issue.
m 1eulsa - add a new sprite code for static lips ls - lip static. mouth just needs to be last, middle codes just cannot conflict with the start of other middle codes (blush/sweat/tears/emote)

text tags still on the table as well.

licensing

MIT is fine

new Qs

whats the size of all libs + potential sprites (estimate is fine)?

Booplicate commented 2 years ago

using cps text tags means this could vary speed mid sentence - probably more realistic than Monika talking at a uniform speed.

Sorry, I meant account for both persistent setting and cps text tag. Because that's how renpy displays text and we should sync with it. The faster the text speed, the fast lips move and vice versa.

my interpretation of what was previously said was that this wouldn't involve changing existing lines

That's what I though too. But I think if required, it'd be fine too. Preferable way is if wouldn't need to update anything, of course.

my point is that we dont have space for 19 more mouth codes if they aren't going to be used in the same way

Misunderstood, yeah I doubt we'd need each of those as a static version. We can use existing mouth sprites for that.

generally a good idea to make a toggle for features that significantly change the existing behavior - not everyone is going to want or like lipflaps.

That I don't get, we don't have a toggle for blinking? Lips would be become just another core feature for Monika sprite. If you want, we can add a new setting, but sounds odd and pointless to me. But that's not what I meant, by that example I meant using a function to switch between static/anim would be inconvenient for people that write dialogues. So we should use a text tag/spite code

sm 1eua

At first I liked it, but how would we use only static eyes or only static mouth with this?

m _1eua

This I like less syntax-wise, but it has the same issue as above.

m 1eulsa

That could work, I think. Although, I liked what @hatanasinclaire suggested more.

hatanasinclaire commented 2 years ago

whats the size of all libs + potential sprites (estimate is fine)?

Each mouth .png is around 9-10 KB, about the same as the existing in-game mouth sprites. There are 19 upright and 19 leaning so together they are around 355 KB.

component	size
`eng-to-ipa 0.0.2`	8.76 MB
`inflect 6.0.0`	291 KB
19 upright mouth sprites (in-game 1280 x 850)	184 KB
19 leaning mouth sprites (in-game 1280 x 850)	171 KB
Total (without code)	\~ 9.41 MB

The high resolution mouth sprites at 2560 x 1700 are around 987 KB.

visemes_sprites.zip

howltrek commented 2 years ago

the idea is very good, but the neutrality of his expression is a bit creepy

ThePotatoGuy commented 2 years ago

we don't have a toggle for blinking?

blinking is less busy, subtle, and it doesnt happen constantly. lipflaps flip through sprites quickly, draw attention to themselves, and would happen for every line of dialogue. these are not comparable.

Booplicate commented 2 years ago

blinking is less busy, subtle, and it doesnt happen constantly. lipflaps flip through sprites quickly, draw attention to themselves

Blinking does happen constantly, sometimes multiple blinks in a row, there is some difference, but generally it's the same kind. It'd be like disabling blinking, poses, following eyes (altho, that one is the closest to make sense being togglable). I don't remember a game that would allow to toggle off facial animation.

To make it not look off, instead of just making people disabling it, we should make it look better. As I said before, we'd need a few sets of sprites, for wide and narrow mouth, for smile and smug. The current one is indeed just neutral. I just remembered, we do have a set of sprites! Commissioned a few months ago.

We also don't need that many frames, we can use dissolve for a transition between frames (like I did with the eyes).

hatanasinclaire commented 2 years ago

If the sprites I provided are not good I am very sorry... If desired, I can remake them or produce additional expression sets, but I will not insist on them being used if a better set already exists.

hatanasinclaire commented 2 years ago

After taking a closer look at the viseme shapes I do concede there is something rather unsettling about them. I remade the sprites, changing two major things:

She is visibly smiling more. This means that there will have to be additional sets of visemes for narrow, wide, and possibly smug as well like Booplicate mentioned, but that's fine and I can produce those if given a little time.
The lower teeth are now hidden. Usually anime mouths do not show the mandibular teeth when speaking, In DDLC the only lower teeth visible in the entire game are on Natsuki's and Yuri's shocked, clenched, and grinning sprites. Sayori's and Monika's sprites do not ever show their lower teeth. The effect of this is that showing the lower teeth produces a very unsettling effect. It is especially egregious on the old 3 and 8 visemes but you can see how in general the old shapes which show the lower teeth don't look great.

I hope these adjustments greatly diminish the creepiness of the visual. Please let me know what you think.

lipsync5

Wingdinggaster656 commented 2 years ago

It does look better, but I think it's still a little scary.

The reason could be the wide opening of the mouth, but I'm not sure.

hatanasinclaire commented 2 years ago

It may also be related to how the eyes do not blink and the eyebrows do not move at all on this demonstrator. Try covering them to see how strong the effect is.

Booplicate commented 1 year ago

I like it. People are just not used to this yet, some found blinking weird at first too because they got used to static images.

Smile is good, but I think we should still have neutral (for serious/sad convos) and other variants
Monika shouldn't stay with the open mouth in the end anymore, it looks weird now
We should have control over the speed, it should be synced with the text speed

Just as a reference, this is what we got some time ago

big gif

![monikaLove](https://user-images.githubusercontent.com/53382877/190672054-514a2a82-f152-4e10-be2e-5df5309fc92a.gif)

ThePotatoGuy commented 1 year ago

Blinking does happen constantly,

its constant, not happening constantly. She's not blinking every 100ms like the lipflap changes. The other things you mentioned (poses, eye tracking) are also slow transition changes or subtle.

hatanasinclaire commented 1 year ago

Smile is good, but I think we should still have neutral (for serious/sad convos) and other variants

I'll make additional sets of mouth shapes for the rest of the expressions if I get the go-ahead.

We should have control over the speed, it should be synced with the text speed

This is not trivial, but should be possible as long as there's a way to get the current "cps" parameter. For most lines, I think it's preferences.text_cps? For lines that manually specify {cps=}, perhaps there is a way to get that. Now, in English, letters don't necessary correspond one-to-one directly to phonemes - "th" and "sh" are examples of two letter digraphs that produce one sound. Visemes don't necessarily correspond directly to phonemes either: "j" and the "a" in "ate" actually require two phonemes. But with a little string handling it ought to be possible to get the timing right.

Booplicate commented 1 year ago

The Text object you can access via _last_text should have a list of text segments in it, each of them should have a cps parameter which accounts for both the preferences and the cps text tag.

You can get those segments, calculate for how long each will be displayed, get the animation time. Then you can calculate the number of lip frames you need to show for each of the segments. E.g you have 3 segments with the durations 1, 1, and 2 seconds. You process the text and then know the first needs 5 frames, the second needs 4, the third needs 7. Which means show 5 frames in 1 second, 4 frames in 1 second, and 7 frames in 2 seconds.

*by frames I mean sprite frames, not screen frames.

howltrek commented 1 year ago

I think this is cool, but we would need to know what licenses eng_to_ipa and num2words use, and what version of Python they require.

I can't find any info on the eng_to_ipa page, it was made in 2020, so probably some kind of Python 3, but no license. num2words seems to be Python 3 only, but also uses LGPL-2.1, which is not ideal because we use a different license (but it might be okay).

If this is Python 3 only, it's fine, we will soon migrate.

even if they didn't have the license, something could be done. assign movements to the type of syllable. It sounds difficult, but it really isn't. there are conzonants that make the lips come together and others that make the tongue move, only 2 types and with the vowels we can separate them into vowels that make the mouth open a little and those that make it open a little more. that is, again only 2 variants. So an algorithm to recognize syllables would only have to recognize 4 possible variables + closed mouth and would give a very natural mouth movement. Syllables are also easy to recognize, due to the presence of vowels. monosyllables because they are between spaces or signs and consonants alone, they would give work, to program them.

hatanasinclaire commented 1 year ago

even if they didn't have the license, something could be done. assign movements to the type of syllable. It sounds difficult, but it really isn't.

eng-to-ipa and inflect both use the MIT license which is already established to be fine.

What would be the point of this? There are far more speaking mouth shapes than the system you're describing, and pronunciation and syllables are not something that can be trivially programmatically determined, at least in English. Letters in English don't correspond exactly to sounds, which is why a dedicated library like eng-to-ipa or a lookup table is necessary.

much faster to just have the other facial parts continue baking and this mouth be separate.

I believe this will be necessary not just for performance, but also to avoid interfering with the blinking transforms.

(edit: My understanding of the situation was incorrect and I have been informed that it can be done through instances of custom mouth displayables.)

Retrolovania commented 10 months ago

Is this still being worked on? This is very impressive material and it'd be great to see in the Python3 release.

dreamscached commented 10 months ago

It is.

Monika-After-Story / MonikaModDev