Closed Technologicat closed 11 months ago
Now that I'm semi-interested in temporarily developing talkinghead
... while I won't promise to support the module indefinitely, I could fix a couple of things now, and while at it, perhaps make the code easier to maintain.
I noticed that app.py
is also broken when run standalone, although it works fine when invoked from server.py
. PR update coming soon.
Do you want to keep the ifacialmocap
stuff, or should I remove it for great justice? My impression is SillyTavern doesn't need it, since we're rendering an independent AI character, not an avatar for the user. In this context, TTS lip-syncing is useful, but facial motion capture isn't.
Also, the system already has emotion presets, but the manual poser is missing a GUI to load them. To speed up my own use case, I could fix this, as well as add a batch mode to produce images for all emotions automatically.
I'll also look into whether the empty live2d
folder is needed, or if we can get rid of it.
Finally, I can also look into performance optimization, but my first impression from the code is that it's just doing things that are inherently expensive - this thing is repeatedly running inference as fast as the hardware can support. One solution here could be to add a user-configurable FPS limiter, like many games have.
I may have some time for this next week.
What do you think?
Minor update that does not affect the manual poser.
When run standalone, app.py
still doesn't do anything (even if I uncomment the main_frame.Show(True)
), but now it at least starts without crashing (e.g. python -m tha3.app.app --char=example.png
).
It seems like the standalone mode of app.py
was originally implemented for testing, likely before the plugin mode was completed. I could debug this a bit further, and see if I can get that testing GUI to actually do something.
Alternatively, I could remove the standalone mode from app.py
, to make it more explicit it's only intended to run as a plugin.
Thoughts?
You can take ownership over this plugin as no one seems more interested in it. As for live2d folder, instead of requiring to create it manually, put and commit a .gitkeep file in it so it will create on pull if not exists.
App mode and facial capture could be pruned if unused. As for FPS limit, it is already not great. 10 FPS at best on good graphics card.
Ok, I'll take this plugin over for now, at least until I'm done playing around with it. As I said, I won't promise long-term support, but I can at least improve the code in the near term.
Thanks for the .gitkeep
trick. I'll do that, unless the folder is completely unused, in which case I'll prune the code to not look for it. EDIT: Yes, the folder was unused. Now the code no longer looks for it.
And ok, I'll prune away anything that's not needed. I think the app mode could be useful for testing/debugging, but not so much if the app code turns out to be broken beyond repair. My overall impression is that this was written as a quick hack, minimally converting an existing app into a SillyTavern plugin, not caring what broke in the process. I haven't compared to the original code by @pkhungurn.
There is also some stylistic weirdness. Module-level global variables are explicitly declared as such even when not written to. I'm tempted to remove those declarations as unnecessary, to shorten the code... but I haven't yet figured out whether those declarations are there because the original author didn't speak Python natively; or because they do, and they know that explicit is better than implicit.
I'll see what I can do about optimization. Perhaps neural networks aren't the best way to create realtime animation. The poser classes seem to be using some kind of a caching mechanism, so if that's working correctly, it shouldn't be rerunning inference for parameter combinations already seen. But I haven't yet read that part through completely.
So if my preliminary understanding is correct, then most of the CPU load could come from constantly refreshing the PNG - because previously unseen parameter combinations should saturate quite early in any given SillyTavern session.
But, while I know about result_feed
in SillyTavern-extras/talkinghead/tha3/app/app.py
, which is a Python generator serving the current image as a PNG, I haven't yet looked at the rest of the system (SillyTavern-extras/server.py
, and I take it that on the client side the relevant source file would be SillyTavern/public/scripts/extensions/expressions/index.js
) for how often it requests that.
10 FPS or no, this is still a cool tech demo. And can be used for generating traditional static expression sprites automatically.
Manual poser update. Here's a quick alpha version of what's cooking.
Emotion preset loading is sort of done, and almost working. When you choose a preset emotion from the dropdown, the output image updates correctly, but the GUI sliders steadfastly refuse to update themselves to the loaded values, no matter what I do. The data is correct, but the GUI controls simply aren't taking programmatic updates. If you mouseover each slider after loading a preset, then that slider updates, but before that it just shows some nonsense value.
~Also, when switching between presets, if the arity of a parameter group changes (i.e. from something that has separate left/right sliders to something that uses just one slider), then the other slider (that is now unused) may disable itself too early, and will refuse to reset to the minimum value although the code tells it to.~ EDIT: Fixed in 397bd0e.
The wxPython docs said nothing about whether a slider needs to be enabled before you can SetValue
it, or when such actions (Enable
or SetValue
) actually take place (immediately when called, or on the next iteration of the event loop - such things could be internally implemented either way).
I have absolutely no idea what's going on. Is wxPython really this buggy, or am I just using it wrong? I'm more of a backend programmer these days, and this just reminded me why. :P
~I'd like to get the GUI working right. Ideally, I'd need the help of someone better versed in wxPython.~ EDIT: Fixed in 249d152.
There are also some other unrelated TODOs. ~For example, the "[custom]" choice in the preset chooser doesn't yet work correctly. Eventually, I intend to automatically switch to that when any setting in the pose is changed manually, to indicate that the pose has been edited. I imagine I'll just need to bind an event to each of the sliders, but I already shudder to think of the cascade of events fighting each other that doing so will likely trigger.~ EDIT: Done in 2dd0c79. Went without a hitch.
~I'm also thinking of implementing saving for new custom presets (without simultaneously saving an output image, like it does now), and loading for presets outside the default emotions folder.~ This could make the manual poser app a nice graphical editor for the talkinghead
emotion poses. EDIT: 19738c7 adds loading of custom emotion JSON files. Decided against a separate JSON saving feature, and just relabeled the button, because saving actually saves both the image and the pose settings that were used to produce it. If you want to update a preset, just save into the "tha3/emotions" folder, and it'll be picked up automatically.
~Then it still needs a batch mode, to render all emotions into sprites in one go. Thankfully that doesn't need to involve the GUI.~ EDIT: Batch save added in 97d9515.
As for performance, I quickly tested running the talkinghead
inference at float16
, by using the separable_half
model in the live mode (by modifying server.py
). Intuition says this should be useful: GPUs support float16
natively, storing weights in float16
is often enough for inference for many AI models, it takes less VRAM, and doesn't require fetching as much data from memory.
But on an RTX 3070 Ti mobile, talkinghead
sits at ~10 FPS at both float32
and float16
. So we can say that the inference that generates the frames is likely not the performance bottleneck - or at least that the memory traffic caused by the AI model is not the bottleneck.
Still, laptop VRAM sizes being what they are, I'll take the VRAM savings. :)
Power draw and VRAM usage, according to nvtop
:
talkinghead
at float32
: 57 W, ~800 MB VRAM for SillyTavern-extrastalkinghead
at float16
: 53 W, ~520 MB VRAM for SillyTavern-extrasSo by running at float16
, we can save about 280 MB of VRAM, with no visible difference in the output.
Note that I also have another AI module, classify
, enabled, because without that talkinghead
would be useless.
Update on GUI investigation: after much searching, found this. Turns out that wx.Slider.SetValue
, specifically, can be stubborn. I don't know if this only happens on the GTK backend, though.
It seems that after the SetValue
, the slider needs to respond to some events before redrawing it is useful. Looking into wx.CallAfter
or wx.SafeYield
as a solution.
Got the programmatic slider updates working, for great justice. The solution was judicious use of wx.CallAfter
. Fixed in 249d152.
Also, when switching between presets, if the arity of a parameter group changes (i.e. from something that has separate left/right sliders to something that uses just one slider), then the other slider (that is now unused) may disable itself too early, and will refuse to reset to the minimum value although the code tells it to.
Ehm, this turned out to be a silly logic bug on my part. Fixed in 397bd0e.
As of 33f0631, the FPS counter in the manual poser works now - it measures the render time only.
Seems that talkinghead
is theoretically capable of ~20 FPS on my RTX 3070 Ti mobile. This is at float32
precision.
Here's a screenshot of the improved manual poser, running on Linux Mint:
Therefore we may conclude that indeed something else must be slowing down the live mode to 10 FPS (when that too is running at float32
).
I'll have to get back to the precision issue. Trying this experiment at float16
, it ran at 10 FPS! This is weird, because on a GPU, usually AI models run faster at float16
than at float32
. Maybe too many type conversions back and forth somewhere.
The manual poser is progressing nicely.
In 19738c7, added the ability to load a JSON file previously produced by this program. There is a new button in the left panel: Load emotion JSON
. The pose from the JSON is applied to the current character.
Note that this feature supports only one pose per JSON file. If there are several emotions defined in the JSON file that is being loaded (like in the fallback emotions/_defaults.json
), the loader picks the first emotion (the topmost one in the file).
Also, when saving, the poser now refreshes the emotion presets list, re-reading the JSON files from disk (which usually means, from the OS's disk cache).
This is needed because it is possible to save the output to "tha3/emotions". So if you do that now, your output will appear in the emotion presets list.
I think this completes the "graphical emotion editor" part.
EDIT: Oh, and logging has been improved a lot.
Still TODO:
That looks nice. Let me know when it will be ready to merge
Thanks. Yes, I'll let you know.
I think we should limit the scope of this PR to the manual poser (as the title says), and then open another one about app mode cleanup and performance improvements (if those turn out to be possible - investigation still underway).
I'll likely have this one complete in the next few days.
Let me know if there's a release deadline. :)
As of 97d9515, Lambda-chan* is excited because batch save is here:
This allows creating all 28 static expressions from a new character in just a few clicks (Load image, choose the PNG file, Batch save, choose output directory, done).
Ran a batch on CPU for testing purposes. Slow, but works. Got about 2 images per second on an i7-12700H.
Now this only needs some hotkeys, and then that's about it for the manual poser. I'll get back to this tomorrow.
*I take it that's her name, based on the hairclip.
One more quick test. Stable Diffusion txt2img, manual align with the template in GIMP, and minimal manual editing to remove the background.
Some expressions work fine:
Others don't work so well (look at the hair):
Head rotations other than in the plane pose a problem at least with twintails. It seems the model is pretty particular about the alpha channel. Looks like it should be clear-cut, no feathering. Maybe I'll need to be more careful there.
Glasses are not an issue as long as the upper line of the eye is visible. If the rim of the glasses covers that line, the model will misunderstand what that line is supposed to represent.
The "eye_wink" morph works pretty well for this character, but the similar-looking "eye_relaxed" doesn't. I suppose that's a case of AI being AI. Out of training distribution?
So, while not exactly fire-and-forget, this looks promising.
Never mind me, just pasting some more testing notes here.
After manual pixel-per-pixel cleanup of edges in GIMP (~20 minutes total), and bumping the contrast of the alpha channel to +100 (to make it binary black/white), we have this input image that looks like it could have come from an early 2000s desktop mascot: When we feed this into THA3, the resulting "curiosity" pose looks a bit cleaner. Here's the result from the batch save, generated by the manual poser: We see that THA3 has a tendency to desaturate colors and lose some contrast. It also doesn't know what to do with the inside part of the character's right twintail. When her head is turned to the side, THA3 imagines a gray part there, no matter that the edges are sharp now.
I suppose a solid hair bunch would be preferable. This is easily done if you draw your characters manually, but not so easy to control in Stable Diffusion renders.
But how does this look like when actually in use? Same pose, composed by ST onto the default cyberpunk background:
Not perfect, but serviceable. The edges of the character could still use some work. In hindsight, if the character is originally rendered onto a white background, it might be useful to use a dark background in GIMP when editing the edges, to see more clearly exactly which pixels need to be erased. Whether a light or dark background is better for the initial render depends on the colors in the character itself - obviously, high contrast between character and background is useful for separation.
Note that this cropped view is cheating a bit - the character is actually missing part of her legs, due to where the generated image ended up on the talkinghead
template when I aligned it. This was just a quick test, so I didn't bother getting a perfect render.
P.S. For those playing at home, here is the raw SD txt2img:
This was the most "simple white background" that I got SD to make in that particular session, with the most suitable pose for the character.
Note the 512x768 size of the image. Many checkpoints for SD 1.5 that are focused at rendering humans are trained at that resolution, and will produce bad output at 512x512. In this case every 512x512 render was blurry no matter the prompt, whereas 512x768 almost always got sharp output.
I removed the background greebles manually in GIMP. In the places where the greeble goes behind the character, the shape of the character outline is pretty simple, so a polygon lasso was fine for this. After the whole background was white, it only took a fuzzy select (wand tool) to select and remove the background for the first version in yesterday's screenshot.
I initially tried the rembg
extension for Automatic1111 to remove the background automatically, but for this image, it didn't produce good results.
I also tried simply growing the selection made by the wand tool by one pixel (instead of manually improving the edges), but that cut too much for this particular image.
Once the character is separated from the background, you can paste the layer onto the talkinghead
template, and use the transform tool to shrink and position it correctly. Add layer mask (from alpha channel), edit layer mask, show layer mask, and brightness and contrast may also come in useful.
Then just export the final image as PNG at 512x512 resolution.
But do note that right now, there's a minor issue with the template - the part that says "512 px" on the image isn't actually 512 px, but around 605 px, so you'll lose some quality when scaling the result down to 512x512. I could make a scaled version - or one could just use Lambda-chan (talkinghead/tha3/images/example.png
) as the template, instead of using the annotated template image.
The checkpoint used in the test is meina-pro-mistoon-hll3
, which you can find on the interwebs (it's capable of NSFW so be careful - I don't know if there are any SFW checkpoints that can reliably produce this pose). The VAE is the standard vae-ft-mse-840000-ema-pruned.ckpt
. 20 steps, DPM++ 2M Karras, CFG scale 7.
A prompt similar to the following likely produces something useful after a few attempts:
(front view, symmetry:1.2), ...character description here..., standing, arms at sides, open mouth, smiling, simple white background, single-color white background, (illustration, 2d, cg, masterpiece:1.2)
Negative:
(three quarters view, detailed background:1.2), full body shot, (blurry, sketch, 3d, photo:1.2), ...character-specific negatives here..., negative_hand-neg, verybadimagenegative_v1.3
EDIT: May need to experiment with the view prompting. A full body shot can actually be useful here, because it has the legs available so we can crop them at whatever point they need to be cropped to align the character's face with the template. The issue is that in SD 1.5, at least with anime models, full body shots often get a garbled face. One possible solution is to txt2img for a good composition only, and then img2img the result, using the ADetailer extension for Automatic1111 (0.75 denoise, with ControlNet inpaint enabled) to fix the face.
I think the talkinghead
plugin needs a new README - I might type some kind of instructions there for others interested in creating their character's base images in SD. There are tutorials like this for creating static expressions manually, but none (that I know of) focused on making input images for talkinghead
.
Added some hotkeys, documented right there in the GUI.
Added drop target: the source image pane now accepts PNG and JSON files drag'n'dropped from the file manager. One file at a time, please. This is the same as using the corresponding load button, but can be convenient if you have the desired folder open in a separate file manager app.
@Cohee1207 The manual poser part is now done.
I think we could draw the line here - any refactoring, app mode cleanup, and performance improvements (if possible) could go into a new PR.
Do you want me to squash the commits?
EDIT: Made some last-minute tidying after posting this message. Now it's done, I promise. :)
Squashing will be nice. I can do this on PR merge as well
Argh, seems I did something wrong with the squash. I'll open a new PR that points to the correct HEAD. :)
EDIT: Opened #204.
TL;DR: Fixed the
talkinghead
manual poser app, it works now.Tested on Linux Mint 21.1.
For context, see #199.
Changes:
modelsdir
parameter toload_poser
,create_poser
(all variants)manual_poser.py
sets it to "tha3/models/", as needed by that appTo run the manual poser:
"SillyTavern-extras/talkinghead"
directorymkdir live2d
~ EDIT: No longer needed.conda activate extras
python -m tha3.app.manual_poser
EDIT: Can also use./start_manual_poser.sh
.This assumes you have the correct wxPython installed, as specified in the SillyTavern-extras README.
Why I think this is important:
The point of this tool is to allow manually posing the face (and the body slightly, too) for static expression images for a custom character, given only a single static image.
Making the various expressions with this tool is much faster than inpainting the character's face for all 28 expressions in Stable Diffusion. The resulting images can then be assigned to a SillyTavern character as its static expression images.
talkinghead
is then not needed while running SillyTavern.If you do run the
talkinghead
module, it additionally offers a live mode, which makes the character appear more alive. However, this comes at the cost of a high CPU or GPU load.A limitation of this tool is that the image size must be 512×512, and the character must be positioned facing directly at the camera in the appropriate pose. The various vtuber checkpoints for Stable Diffusion should be able to help with this.