Open hatanasinclaire opened 2 years ago
I think this is cool, but we would need to know what licenses eng_to_ipa
and num2words
use, and what version of Python they require.
I can't find any info on the eng_to_ipa
page, it was made in 2020, so probably some kind of Python 3, but no license.
num2words
seems to be Python 3 only, but also uses LGPL-2.1
, which is not ideal because we use a different license (but it might be okay).
If this is Python 3 only, it's fine, we will soon migrate.
eng-to-ipa
says it has an MIT license on its Github repo, but I could not find any information on version compatibility. I was able to get the demo to run in a Miniconda Python environment running Python 3.9.12 (not sure if this is helpful), however.
num2words
can be replaced with inflect
which has an MIT license and requires Python 3.7 or higher, so it will work after the migration. I will work on testing inflect
and replacing num2words
with it.
It seems to me to me that the implementation of this will require a new class MASMoniTalkTransform()
to be described in sprite-chart.rpy
, similar to the existing class MASMoniBlinkTransform()
and wink transform. This will be called in sprite-generation.rpy
within generate_normal_sprite()
. Does this sound correct?
Aside from lib licensing, hows the performance of this? Probably should wait until r8 at minimum but rapid sprite rendering might be too much.
This also should be optional/we would not be replacing the task of picking sprite code expressions, as the mouth is only 1 part of the expression, and we need static mouth selections for songs, piano, idle, etc...
We need to find a way to use dynamic and static mouths at the same time. It doesn't make much sense to always stay with open mouth now that it'd be animated, at the same time, in some cases we want it to be static or keep it open.
So it's something we should brainstorm.
Talking about songs, we need a way to change the speed of her speech.
Aside from lib licensing, hows the performance of this? Probably should wait until r8 at minimum but rapid sprite rendering might be too much.
This also should be optional/we would not be replacing the task of picking sprite code expressions, as the mouth is only 1 part of the expression, and we need static mouth selections for songs, piano, idle, etc...
The mouth needs to update around 10 times a second to give the appearance of a normal talking speed. I believe this is slower than the blinking animation, which appears to have transitions of less than 0.1 seconds. However, I have not tested an in-game implementation yet.
This system does not replace the task of sprite code expressions but rather specifying the mouth manually is still necessary. The animation shows the mouth specified in the exp code after the mouth flaps are done. There are many instances where the mouth ought to stay open as how the exp code specifies. such as when she is making a big smile or is surprised. One potential area for expansion (thanks to Booplicate for this idea) is having another set of visemes with a wider open mouth when she is speaking and angry or surprised (such as exp code mouths b
, o
, w
, x
), and another set with a less open mouth for when she is speaking in a low voice (exp code has mouth c
, d
, or t
)
ok, but in terms of rendering this would have to bypass spritecode system entirely because facial expressions are baked together when rendered, much faster to just have the other facial parts continue baking and this mouth be separate. so the way this defines the additional mouth sprites for this does not have to follow the same pattern as the current system unless the intention is to be usable as static sprites as well.
for control specifics, enable/disable functions are a must, but for inline options, maybe another keyword like m 1eua static
or custom text tags. actually, having custom text tags control the speed and when its enabled is probably the most configurable option.
Side note, since it's relevant, with r8 I'm planning to overhaul Monika's displayable/draw function. It'll be much more effective as its own displayable. That means getting rid of baking her parts together and caching them ourselves, getting rid of current dynamic displayable. Along with big optimisation it'll give us ability to use shaders and any kind of displayables for Monika's parts (including acs, outfits, etc).
So each part of her can and probably will be some kind of sub-displayable, so using a special displayable for mouth would make sense, especially for something like this.
so the way this defines the additional mouth sprites for this does not have to follow the same pattern
if you mean only definition, then fine, but the syntax for usage should be close to what we already have. Animated lips will be used in most places.
for control specifics, enable/disable functions are a must
So I'd like to avoid syntax like
m 1eua "Hello!"
$ mas_enable_anim_lips()
m 1eub "World!"
it would be very inconvenient to use, even rarely.
maybe another keyword like m 1eua static or custom text tags
Maybe text tags to force static, yeah. But unsure about m 1eua static
, since m 1eua_static
is already a thing and it makes her eyes static. We can't use one keyword to control both eyes and mouth as both can be used independently. I thought about adding s
to the mouth code, like b
is the animated form of the open mouth with smile, but sb
is the static form. But iirc we don't like sprite codes of ambiguous length.
having custom text tags control the speed and when its enabled is probably the most configurable option.
Not this one though, speed should be taken from the current persistent cps speed.
Regarding the licensing - I tested out inflect
and it works just fine as a replacement for num2words, though of course it will only work after the Python 3 migration. Is the MIT License for inflect
and eng-to-ipa
acceptable?
ok, but in terms of rendering this would have to bypass spritecode system entirely because facial expressions are baked together when rendered, much faster to just have the other facial parts continue baking and this mouth be separate. so the way this defines the additional mouth sprites for this does not have to follow the same pattern as the current system unless the intention is to be usable as static sprites as well.
The 19 new mouth shapes are not intended to be callable by exp codes / usable as static sprites so they can be assigned their own system.
maybe another keyword like m 1eua static or custom text tags
Maybe text tags to force static, yeah. But unsure about
m 1eua static
, sincem 1eua_static
is already a thing and it makes her eyes static. We can't use one keyword to control both eyes and mouth as both can be used independently. I thought about addings
to the mouth code, likeb
is the animated form of the open mouth with smile, butsb
is the static form. But iirc we don't like sprite codes of ambiguous length.
Would it make sense to have a system like:
m 1eua_static
makes only her eyes staticm 1eua_nolipsync
makes only her mouth staticm 1eua_static_nolipsync
makes both her eyes and mouth static?Just by the way, I find the current mouth animation a little scary. Or maybe it's because Monika takes up most of the image in the Gif. Maybe it will look a lot better in the actual game.
Not this one though, speed should be taken from the current persistent cps speed.
using cps text tags means this could vary speed mid sentence - probably more realistic than Monika talking at a uniform speed.
if you mean only definition, then fine, but the syntax for usage should be close to what we already have.
my interpretation of what was previously said was that this wouldn't involve changing existing lines - so all current mouth expressions would just notify this system what set of non-sprite-code mouth sprites to use. aka no need to define new mouth sprite code strictly under the current system.
(my point is that we dont have space for 19 more mouth codes if they aren't going to be used in the same way - and it seems hatansinclaire is in agreement)
Animated lips will be used in most places...it would be very inconvenient to use, even rarely.
generally a good idea to make a toggle for features that significantly change the existing behavior - not everyone is going to want or like lipflaps.
overhaul Monika's displayable/draw function
talk to me before you do this
on static
that's true, we do already have *_static
sprites, so how about some other less wordy options:
sm 1eua
- make new character object sm
for "Static Monika". would mean manual show/hiding unless we did something fancy with the disps.m _1eua
- can pick a letter/letters to add to the start of a sprite code - avoids the ambiguity issue. m 1eulsa
- add a new sprite code for static lips ls - lip static
. mouth just needs to be last, middle codes just cannot conflict with the start of other middle codes (blush/sweat/tears/emote)text tags still on the table as well.
licensing
MIT is fine
whats the size of all libs + potential sprites (estimate is fine)?
using cps text tags means this could vary speed mid sentence - probably more realistic than Monika talking at a uniform speed.
Sorry, I meant account for both persistent setting and cps text tag. Because that's how renpy displays text and we should sync with it. The faster the text speed, the fast lips move and vice versa.
my interpretation of what was previously said was that this wouldn't involve changing existing lines
That's what I though too. But I think if required, it'd be fine too. Preferable way is if wouldn't need to update anything, of course.
my point is that we dont have space for 19 more mouth codes if they aren't going to be used in the same way
Misunderstood, yeah I doubt we'd need each of those as a static version. We can use existing mouth sprites for that.
generally a good idea to make a toggle for features that significantly change the existing behavior - not everyone is going to want or like lipflaps.
That I don't get, we don't have a toggle for blinking? Lips would be become just another core feature for Monika sprite. If you want, we can add a new setting, but sounds odd and pointless to me. But that's not what I meant, by that example I meant using a function to switch between static/anim would be inconvenient for people that write dialogues. So we should use a text tag/spite code
sm 1eua
At first I liked it, but how would we use only static eyes or only static mouth with this?
m _1eua
This I like less syntax-wise, but it has the same issue as above.
m 1eulsa
That could work, I think. Although, I liked what @hatanasinclaire suggested more.
whats the size of all libs + potential sprites (estimate is fine)?
Each mouth .png is around 9-10 KB, about the same as the existing in-game mouth sprites. There are 19 upright and 19 leaning so together they are around 355 KB.
component | size |
---|---|
eng-to-ipa 0.0.2 |
8.76 MB |
inflect 6.0.0 |
291 KB |
19 upright mouth sprites (in-game 1280 x 850) | 184 KB |
19 leaning mouth sprites (in-game 1280 x 850) | 171 KB |
Total (without code) | \~ 9.41 MB |
The high resolution mouth sprites at 2560 x 1700 are around 987 KB.
the idea is very good, but the neutrality of his expression is a bit creepy
we don't have a toggle for blinking?
blinking is less busy, subtle, and it doesnt happen constantly. lipflaps flip through sprites quickly, draw attention to themselves, and would happen for every line of dialogue. these are not comparable.
blinking is less busy, subtle, and it doesnt happen constantly. lipflaps flip through sprites quickly, draw attention to themselves
Blinking does happen constantly, sometimes multiple blinks in a row, there is some difference, but generally it's the same kind. It'd be like disabling blinking, poses, following eyes (altho, that one is the closest to make sense being togglable). I don't remember a game that would allow to toggle off facial animation.
To make it not look off, instead of just making people disabling it, we should make it look better. As I said before, we'd need a few sets of sprites, for wide and narrow mouth, for smile and smug. The current one is indeed just neutral. I just remembered, we do have a set of sprites! Commissioned a few months ago.
We also don't need that many frames, we can use dissolve for a transition between frames (like I did with the eyes).
If the sprites I provided are not good I am very sorry... If desired, I can remake them or produce additional expression sets, but I will not insist on them being used if a better set already exists.
After taking a closer look at the viseme shapes I do concede there is something rather unsettling about them. I remade the sprites, changing two major things:
I hope these adjustments greatly diminish the creepiness of the visual. Please let me know what you think.
It does look better, but I think it's still a little scary.
The reason could be the wide opening of the mouth, but I'm not sure.
It may also be related to how the eyes do not blink and the eyebrows do not move at all on this demonstrator. Try covering them to see how strong the effect is.
I like it. People are just not used to this yet, some found blinking weird at first too because they got used to static images.
Just as a reference, this is what we got some time ago
Blinking does happen constantly,
its constant, not happening constantly. She's not blinking every 100ms like the lipflap changes. The other things you mentioned (poses, eye tracking) are also slow transition changes or subtle.
Smile is good, but I think we should still have neutral (for serious/sad convos) and other variants
I'll make additional sets of mouth shapes for the rest of the expressions if I get the go-ahead.
We should have control over the speed, it should be synced with the text speed
This is not trivial, but should be possible as long as there's a way to get the current "cps" parameter. For most lines, I think it's preferences.text_cps
? For lines that manually specify {cps=}
, perhaps there is a way to get that. Now, in English, letters don't necessary correspond one-to-one directly to phonemes - "th" and "sh" are examples of two letter digraphs that produce one sound. Visemes don't necessarily correspond directly to phonemes either: "j" and the "a" in "ate" actually require two phonemes. But with a little string handling it ought to be possible to get the timing right.
The Text
object you can access via _last_text
should have a list of text segments in it, each of them should have a cps
parameter which accounts for both the preferences and the cps
text tag.
You can get those segments, calculate for how long each will be displayed, get the animation time. Then you can calculate the number of lip frames you need to show for each of the segments. E.g you have 3 segments with the durations 1, 1, and 2 seconds. You process the text and then know the first needs 5 frames, the second needs 4, the third needs 7. Which means show 5 frames in 1 second, 4 frames in 1 second, and 7 frames in 2 seconds.
*by frames
I mean sprite frames, not screen frames.
I think this is cool, but we would need to know what licenses
eng_to_ipa
andnum2words
use, and what version of Python they require.I can't find any info on the
eng_to_ipa
page, it was made in 2020, so probably some kind of Python 3, but no license.num2words
seems to be Python 3 only, but also usesLGPL-2.1
, which is not ideal because we use a different license (but it might be okay).If this is Python 3 only, it's fine, we will soon migrate.
even if they didn't have the license, something could be done. assign movements to the type of syllable. It sounds difficult, but it really isn't. there are conzonants that make the lips come together and others that make the tongue move, only 2 types and with the vowels we can separate them into vowels that make the mouth open a little and those that make it open a little more. that is, again only 2 variants. So an algorithm to recognize syllables would only have to recognize 4 possible variables + closed mouth and would give a very natural mouth movement. Syllables are also easy to recognize, due to the presence of vowels. monosyllables because they are between spaces or signs and consonants alone, they would give work, to program them.
even if they didn't have the license, something could be done. assign movements to the type of syllable. It sounds difficult, but it really isn't.
eng-to-ipa
and inflect
both use the MIT license which is already established to be fine.
What would be the point of this? There are far more speaking mouth shapes than the system you're describing, and pronunciation and syllables are not something that can be trivially programmatically determined, at least in English. Letters in English don't correspond exactly to sounds, which is why a dedicated library like eng-to-ipa
or a lookup table is necessary.
much faster to just have the other facial parts continue baking and this mouth be separate.
I believe this will be necessary not just for performance, but also to avoid interfering with the blinking transforms.
(edit: My understanding of the situation was incorrect and I have been informed that it can be done through instances of custom mouth displayables.)
Is this still being worked on? This is very impressive material and it'd be great to see in the Python3 release.
It is.
Overview
The following is a description for an automated system designed specifically to automatically determines a sequence of face shapes ("visemes") from Monika's dialogue. The idea behind this system is that the face shapes do not have to be specified manually for each line of dialogue, and therefore does not require changing tens of thousands of lines of existing dialogue code.
This system performs a series of steps to convert text to facial shapes:
num2words
and some punctuation is stripped out.eng-to-ipa
'sconvert()
function to convert the text to the IPA representation of how it is pronounced.eng-to-ipa
cannot catch the pronunciation of every word. A secondary function is used to manually specify pronunciation for some of the words thateng-to-ipa
does not recognize. Failing this, the word is skipped if no pronunciation is available.I identify nineteen distinct mouth shapes corresponding to fifty phonemes. Brand new mouth sprites for these are included.
Linked below is a hastily thrown together demo that utilizes pygame to preview how this looks with an arbitrary input. Please try it out!
Next Steps
This system currently has no way to connect directly to MAS. This is where I will need some developer help.
The first step the system needs to take for each line of dialogue is to get the dialogue actually displayed. Booplicate points out that
mas_core._last_text
can be used to find the current line.The final step the system needs to perform is sending the sequence of visemes to Monika's sprite object then displaying them on screen. This part will require more substantial addition to the sprite system, and will likely constitute much of the remaining work to be done. To any devs interested in incorporating this into Monika After Story, I am available for further discussion regarding this here or on Discord.
Package Requirements
re
- which of course is already prevalent in Monika After Story.eng_to_ipa
num2words
Limitations
mas_core._last_text
works, variables may or may not be stripped out of the text. But I am not too familiar with exactly what it outputs.Code
The code, along with the sprites and a demo program can be found at this repository.
External Links
eng-to-ipa num2words
The phoneme-to-viseme conversion is informed by existing viseme systems but ultimately is unique to this prototype. Face the FACS Oculus Viseme Reference Microsoft Azure Cognitive Services