talkinghead changes TODO list

SillyTavern / SillyTavern-Extras

Extensions API for SillyTavern.

GNU Affero General Public License v3.0

516 stars 122 forks source link

talkinghead changes TODO list #206

Open Technologicat opened 6 months ago

Technologicat commented 6 months ago

EDIT: This has become more of a temporary devblog and less of a TODO list.

:exclamation: Development can move fast. Old posts are old. :exclamation: See the latest posts below for what is currently going on.

Talkinghead TODOs, as of the latest merged PR at any given moment, can be found in talkinghead/TODO.md in the main SillyTavern-extras repo.

EDIT: :exclamation: The rest of this post is old, preserved for archival purposes only. :exclamation:

This is primarily for myself to keep track of what I'm doing, as well as to record any leftover ideas that are not so likely to get done.

Strip away unnecessary stuff from app.py. EDIT: Done, the plugin is now a bit over 400 lines.
- ~It seems the Mac-specific IFacialMocap stuff has mostly been stripped already, but the app suffers from the pieces that remain.~ EDIT: Removed in https://github.com/Technologicat/SillyTavern-Extras/commit/4a25a1e64addf3062d36163ce9b60ddc1561f07f.
- We could use the THA pose model only, with no need for conversion between two incompatible sets of pose keys. EDIT: Yup, that's how it works now.
- Also lip sync code (which the old manual hints at) seems to have been stripped - the talking animation is random.
- Syncing to TTS would be nice in the long term, but not so relevant for me. I don't have the VRAM to run yet another deep learning model simultaneously, and text output is fine for my use cases anyway.
- EDIT: But the start/stop talking animation API endpoints are not getting called at all! Need to add calls to the client side.
- The standalone app mode for the live mode has already been stripped to the point where the app does nothing.
- ~Thus, I'll probably make app.py only serve the SillyTavern-extras plugin, and remove the standalone app mode from it. There's no way to send it events to the standalone mode anyway, and the separate manual poser app already covers the other use case.~ EDIT: As of https://github.com/Technologicat/SillyTavern-Extras/commit/e02c1d98f273e71b3c7f07b79ac774d64a8e376b, app.py only serves the live mode.
- When running as a SillyTavern-extras plugin, the live mode doesn't probably even need wxPython.
  - ~Currently, it has some calls into wxPython, but I'll see if I can strip it.~ EDIT: Yup, stripped.
  - ~When talkinghead is enabled, SillyTavern-extras segfaults on exit every time. And doesn't when the talkinghead module is not loaded. I wonder if the unnecessary wx.App never being cleaned up properly might have something to do with that.~ EDIT: Yes, exactly. No more crashes.
  - The manual poser does need wxPython, as it's a local GUI app implemented in... wxPython. (This in itself is just fine.)
- ~Emotions are loaded from disk every frame, twice. Surely this is unnecessary. Even if the read hits the OS disk cache, we could just explicitly keep the emotion presets in memory, like the manual poser now does.~ EDIT: This has been fixed.
- ~The animation logic is a mess. There are probably a few steps that can be dropped. However, this will likely improve the code clarity much more than performance.~ EDIT: Fixed.
- ~There are two sway animation functions, one "good" and the other... implicitly, bad? Investigate which is actually better, and keep only one.~ EDIT: They were pretty much identical; kept the one that was already in use.
- By the way, why doesn't the body sway (though we have a morph for that), only the head does? Maybe this animation is missing something. EDIT: Yes, it's only animating head_y. Improving this would make the animation look nicer.
- Also, experiment with a cosine schedule for the random head rotation and sway animations. Currently the direction of motion might switch instantly, which looks awkward. EDIT: It's actually a rate-based formulation (think integrating an ODE). This is probably better because this can be (and is) defined in terms of the current state and desired end state. Even if the end state suddenly changes, a rate formulation will adapt nicely.
~Add an option to server.py to choose float32 or float16 for talkinghead. Regardless of speed, useful with low VRAM; can save ~280 MB by switching to float16.~ EDIT: Done.
~Refactor stuff shared between the manual poser and live mode into a common app-level utility module.~ EDIT: Done.
- ~For example, the updated emotion preset loading code currently lives in the manual poser app.~ EDIT: Moved to a new talkinghead/tha3/app/util.py, along with other code shared between the apps.
Investigate possibilities for improving the speed.
- EDIT: axing the wxPython dependency in app.py did it! Now the live mode runs at ≈30 FPS on an RTX 3070 Ti mobile. https://github.com/Technologicat/SillyTavern-Extras/commit/e02c1d98f273e71b3c7f07b79ac774d64a8e376b
- ~The live mode currently runs at ≈10 FPS.~
- Though this is an AI plugin, relatively low GPU usage (according to nvtop) suggests that inference isn't the largest bottleneck. (It may be the second largest, though.)
- The manual poser can run inference at ~20 FPS on an RTX 3070 Ti mobile, if all we do with the result is convert it into a wx image and render it in a local GUI app. (I hope my measurement code is correct. It should be.)
- Repeatedly sending PNGs surely isn't the fastest way to stream video.
- EDIT: Investigate later about streaming as YUVA420 using ffmpeg.
- ~(More observations may follow later as investigation proceeds.)~ EDIT: There were no other factors slowing it down.
EDIT: Improve live mode simulation timestep handling.
- Animation speed should be measured against wall time (instead of rendering as fast as the system can), since new GPUs are likely to become faster in the future. EDIT: Not so necessary now that we can control the target frame rate.
- ~For future-proofing and saving GPU compute, add a configurable framerate limiter. Could be given as a command-line option to server.py.~ EDIT: Framerate limiter added, but currently fixed to ~24 FPS.
Write new README/manual for talkinghead.
- Both the use case and the supported features differ from the original THA3 package.
- The manual poser app now has a few features the original didn't, and its usability has been improved.
- Emotion presets can now be accessed in the manual poser GUI.
  - The presets are just JSON files in talkinghead/emotions/.
- Load a custom emotion JSON, applying it to the editor.
  - Saving an image has always saved the JSON, too - this allows loading that JSON back in.
- This combination of features makes the manual poser into a graphical emotion editor for talkinghead.
- Batch save image and JSON from all emotion presets.
  - Useful for generating the 28 static expression sprites automatically.
- Drag'n'drop PNG or JSON from a file manager into the source image pane, to load that file.
- Some hotkeys.
- Can now optionally run on CPU, if a framerate of 2 FPS doesn't matter. (Fine enough for batch export.)
- Could include hints as to how to create the base image for a character in Stable Diffusion and GIMP. See comments in #203. (That PR itself had to be moved to #204 when I messed up my commit squashing workflow.)
- What I don't understand at the moment is the varying quality of the output. The original examples look pretty impressive (obviously these things are almost always cherrypicked, but still). Trying to do the same with a custom input image doesn't reach the same quality, even though the viewpoint is correct and I'm aligning the character to the template correctly.
- Is the output of my SD checkpoint too far off from the kind of illustrations that were used in the training data of THA3? If so, which checkpoint would be better?
- In the description of the first THA (the original postings were fortunately preserved by the Wayback Machine!), the author mentions that the training data consisted of ~8000 MikuMikuDance characters, placed in various random poses that THA was intended to approximate, for a total of slightly over 1 million images. For THA2, the set of models was the same, but the image count was bumped up to 3.5 million. THA3 used the same models, but only 500k images. The author notes that while the training data (rendered with his custom renderer with standard Phong shading) looks more like a 3D model than a drawing, it's similar enough that the system works just as well on drawings. Indeed, the examples demonstrate that it does.
- "More specifically, pixels that do not belong to the character must have the RGBA value of (0,0,0,0), and those that do must have non-zero alpha values." So yes, sharp edges in the alpha channel are preferable for use with THA.
- There's also a picture of what the various morphs should do; most of them are self-explanatory, but "mouth delta" puzzled me. It's named that way because this morph makes the shape of the mouth into an uppercase delta, Δ.
- The most recent version, THA4, doesn't sound so useful for SillyTavern, as it's slower than THA3 (150 ms per frame on an RTX Titan). There's distillation, but that requires training the student network for each specific character separately. Also, I don't know if the models or code are available or not, at least there doesn't seem to be a GitHub repository for them.
On a 4k display, the character becomes rather small, which looks jarring on the default backgrounds. This needs a fast, but high-quality scaling mechanism.
- The algorithm should be cartoon-aware, some modern-day equivalent of waifu2x. A GAN such as 4x-AnimeSharp or Remacri would be nice, but too slow.
- Maybe the scaler should run at the client side to avoid the need to stream 1024x1024 PNGs. What JavaScript anime scalers are there?

Notes:

The models themselves suffer from chronic object orientation poisoning (spoken as an impure-functional programmer). The code would read better with fewer abstractions, and less ravioli. But the models run fine, so no need to touch them. The API cross-section (at least as actually needed here) is pretty minimal, which is good.

Technologicat commented 6 months ago

Yup, wxPython is not needed by app.py. Got rid of it, and:

INFO:talkinghead.tha3.app.app:FPS: 30.3

This got rid of the crash on exit, too. It was likely because wx.App was running in a background thread, which is a no-no.

There's still a lot to fix. It's still eating a lot of resources, but at least it's doing something useful with them.

We'll need better idle animations, too, now that the framerate is improved.

PR upcoming eventually. In the meantime, preview here: https://github.com/Technologicat/SillyTavern-Extras/tree/appfixes

Technologicat commented 6 months ago

@Cohee1207: One specific question: you mentioned that you find talkinghead uncanny. Was it because of the low framerate, the quality of the idle animations, or the image quality of the AI interpolation? Trying to evaluate if I could do something.

The framerate is now at least somewhat fixed, and I might have some ideas to improve the idle animations, but the quality of the AI interpolation is what it is as long as we're running on the THA3 models. Hardware is also what it is, so I think this is the right model size right now.

But judging by the pictures in the tech reports, THA3 should be capable of rather impressive quality, given the right input. With regard to this part, how to produce a suitable input easily with SD/GIMP is the open question.

Cohee1207 commented 6 months ago

I don't find the motion produced to look pleasant or enjoyable to the degree of just a good static image. Maybe I'm biased or influenced by the publicity of some well-known "AI animations" that have a similar vibe to them (for reference).

Technologicat commented 6 months ago

Thanks for your input!

Personally, I don't see anything wrong with the animation in the link you provided, except that in animation, my own preference is 2D. Used to be an anime fan back in the day. 3D CG has always looked wrong to me. I suppose it's a matter of taste.

The specific question was because at the higher framerate, the current idle animations of talkinghead look off to me. At least now we have some speed to do something interesting with. I'll look into it.

Technologicat commented 6 months ago

As of 4a25a1e, the remnants of the IFacialMocap stuff are gone from the code. Animation logic rewritten, for great justice.

Next to clean up the repo.

And to figure out what I borked in my local git, it's telling me that origin/appfixes is not a branch (and fails to update the corresponding head), although the push to GitHub works fine.

Technologicat commented 6 months ago

Repo cleaned up. The plugin is now ~400 lines, and the code that remains looks much cleaner. :)

Technologicat commented 6 months ago

PR posted, see #207.

EDIT:

Planned next:

Future-proof the renderer.
- Decouple simulation timestep from render speed (we're not an early 1980s video game!).
- Measure against wall time (we already measure FPS anyway) and calibrate automatically.
- Add a configurable FPS limiter, for saving GPU compute especially on future, faster GPUs.
- EDIT: FPS limiter added, but it's not configurable yet. For now, hardcoded to send ~24 FPS to the client.
~Improve idle animations.~ EDIT: New sway and blink animations done, breathing added.
- The pose interpolation already uses what we computational scientists call a rate-based formulation.
- That is, given the state variables and the time coordinate, the formulation gives us the instantaneous rate of change of the state variables - much like in the standard first-order ODE system u' = f(u, t).
- This kind of formulation only needs the current and target states; history doesn't matter (or, like in some computational plasticity models in the engineering sciences, can be represented by adding a new history-free state variable that in effect tracks what the history has done to the thing being modeled; in programming terms, an internal private state, like objects have).
- The current implementation uses the elementwise diff [target_pose - current_pose], multiplied by a step size, as the rate of change.
- We can leverage this observation to simplify the sway code: just modify the target pose appropriately, and let the ODE integrator perform the actual interpolation.
- For a constant target state, the approach implies that the state evolves akin to (1 - exp(-λt)) for some constant λ - which visual observation confirms it indeed does.
- If we want a smooth start, we can e.g. save the timestamp when the animation began, and then ramp the rate of change, beginning at zero and (some time later, as measured from the timestamp) ending at the original, non-ramped value. The ODE itself takes care of slowing down when we approach the target state.
- This also implies that integrating by explicit Euler, like we do now, isn't the best possible idea (for arbitrary problems f; currently, there is some structure that mitigates this).
- To be safe, we should probably switch to Heun, like Kerbal Space Program eventually did, or perhaps to some other explicit Runge-Kutta. To anyone not in ODE/PDE numerics, it can't be overstated how much using anything but explicit Euler helps with a simulation not exploding at large timestep sizes.
- EDIT: Actually, although the rate-based formulation of the pose update is essentially Newton's law of cooling, we are effectively reading off points from the analytical solution, not numerically integrating. So we don't actually need to do anything special - the current pose integrator is already stable.
- ~Improve blinking~ (EDIT: Done!) , both the frequency (it's not a Markov process...) and the actual animation.Dolphin-Mistral told me that for humans, the frequency should be 12-20 times a minute :P
- [This is the first time a piece of software has given me suggestions on how to improve itself. I should work more with LLMs.]
- ~Add all supported axes to sway animation as far as they make sense. Experiment with this.~ (EDIT: Done!)
- Every x seconds of simulated time, randomize a new target deviation on top of the pose of the target emotion. Clamp to a reasonable range (original head sway code uses -0.6 ... 0.6; zero is centered, and 1.0 = 15 degrees as documented in the THA tech reports). Let the ODE integrator interpolate.
- ~Some emotions could optionally change the idle animations when the new emotion is entered, e.g. to make a confused character quickly blink twice.~ EDIT: Done.

TODO:

Investigate if some particular emotions could use a small oscillation applied to iris_small, for that anime "intense emotion" effect (since THA3 doesn't have a morph specifically for the specular reflections in the eyes).
Investigate if we can get the (random) talking animation working when the LLM is streaming text.
Make the idle animation parameters configurable, to give personality to different characters.
Other TODOs marked in the plugin source code, talkinghead/tha3/app/app.py.

Technologicat commented 6 months ago

Implemented a framerate limiter in talkinghead-next@Technologicat.

This branch also includes a command-line flag in server.py to choose the talkinghead compute model.

Result:

INFO:talkinghead.tha3.app.app:available render FPS: 46.0
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 23.9
INFO:talkinghead.tha3.app.app:available render FPS: 46.7
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 23.9
INFO:talkinghead.tha3.app.app:available render FPS: 46.6
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.0
INFO:talkinghead.tha3.app.app:available render FPS: 46.8
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.0
INFO:talkinghead.tha3.app.app:available render FPS: 46.1
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 23.8
INFO:talkinghead.tha3.app.app:available render FPS: 45.5
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 23.4

The available render FPS measures how fast the animator can run on the current hardware. This is using the separable_half model (i.e. float16). Same machine as before, with an RTX 3070 Ti mobile.

The rate-limited network FPS measures the actual time between network sends, after applying the framerate limiter.

The code limits the network FPS to a hard-coded 25 (0.04 seconds wait per frame). Due to the simplistic way that I currently calculate the wait time, this sets an upper bound that's never reached exactly.

I tried more sophisticated ways to calculate the wait time, but they turned out brittle, and didn't improve the result. So I think the simplistic version is the best - ~24 FPS is just fine.

We now render only as many frames as the client consumes, so as long as the render FPS > network FPS, this will save GPU compute resources compared to the previous versions.

Next up:

Now that we always run near 24 FPS given enough GPU compute, I'll leave the timestep implementation as-is, unless there is interest in supporting lower-spec hardware that can't reach that 24 FPS.

So the next step is improving the idle animations.

EDIT: But there's also CPU mode, which runs at ~2 FPS on an i7-12700H, but may be useful for testing. Fixed the framerate limiter to work correctly also when render FPS < network FPS (in that case, the latest rendered frame is re-sent until a new one becomes available). But the animation logic needs to account for this.

Technologicat commented 6 months ago

In talkinghead-next, added:

A simple breathing animation driver
Very initial experiment at multi-axis sway (needs tuning - the character looks like she's had too much to drink)

I already have a plan how to improve the sway animation. Stay tuned...

Technologicat commented 6 months ago

Um... turns out that while testing, I had accidentally underclocked my GPU to 1100 MHz (from 1700 MHz).

I mean, I do that on purpose to reduce fan noise, but the underclock wasn't supposed to be active during the performance test. I meant to run at factory settings, to give a better idea of performance on a stock RTX 3070 Ti mobile GPU chip (for reference, 125W TDP; there are various laptop brands/models with the same GPU, but different GPU TDP).

So, rerunning the test at full clock rate. Result:

INFO:talkinghead.tha3.app.app:available render FPS: 62.5
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.2
INFO:talkinghead.tha3.app.app:available render FPS: 63.1
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.2
INFO:talkinghead.tha3.app.app:available render FPS: 63.4
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.1
INFO:talkinghead.tha3.app.app:available render FPS: 60.0
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.1

Compare with the GPU underclocked:

INFO:talkinghead.tha3.app.app:available render FPS: 47.1
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.1
INFO:talkinghead.tha3.app.app:available render FPS: 45.2
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 23.8
INFO:talkinghead.tha3.app.app:available render FPS: 47.2
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.2
INFO:talkinghead.tha3.app.app:available render FPS: 46.7
INFO:talkinghead.tha3.app.app:rate-limited network FPS: 24.3

So it turns out this thing can render at 60 FPS with the separable_half model. GPU power draw is then near 80 W, though.

I can only speculate how fast it would run (and how much power it would draw) on a desktop GPU.

Technologicat commented 6 months ago

In talkinghead-next, added:

New history-free sway engine (using a rate-based formulation):
- Every few seconds, randomize a new deviation from the target pose for all sway axes (head, neck, body).
- Assign this as the target pose, and let the pose interpolator perform the actual animation.
- Micro-sway: add small dynamic noise (re-generated every frame) on top of sway target pose.
- This makes the motion look more natural, especially once we are near the target pose.

Technologicat commented 6 months ago

talkinghead-next has been PR'd, see #209.

Technologicat commented 6 months ago

One more TODO that has not been mentioned here yet:

The eye_unimpressed morph has just one key in the emotion JSON, although the model has two morphs (left and right) for this.
- We should fix this, but it will break backward compatibility for old emotion JSON files.
- OTOH, maybe not much of an issue, because in all versions prior to this one being developed, the emotion JSON system was underutilized anyway (only a bunch of pre-made presets, only used by the live plugin). All the more important to fix this now, before the next release, because the improved manual poser makes it easy to generate new emotion JSON files, so from now on we can assume those to exist in the wild.

Also, to investigate:

Can you /emote a live talkinghead? Would make testing much easier... EDIT: Apparently not. But I suppose this would be easy to add. Needs a new API endpoint in server.py, and then making the client handle the /emote xxx by calling it.

Technologicat commented 6 months ago

It's still experimental, but here, a small xmas present to the open source community.

talkinghead-nextnext now renders your live AI avatar as a lo-fi scifi hologram:

Yes, she's translucent:

The scanlines and noise are dynamic, and the bloom (fake HDR) imitates the look of early 2000s anime.

GPU powered, as usual. This consists essentially of a few small fragment shaders written in Torch. :P

Man, I love open source.

TODO: clean up the code, make this configurable, and see if we can improve performance (at full GPU power, 48 FPS render, 18 FPS network send; underclocked, 39 FPS render, 18 FPS network send).

Cohee1207 commented 6 months ago

Thanks, that looks cool. But I'm not sure many people frequently monitor these issues. To get a wider audience for this, do a post on resources like Reddit.

Katehuuh commented 6 months ago

I would be great to have all in on talkinghead+live2d or at last call some feature from live2d to make it more compatible.

Technologicat commented 6 months ago

@Cohee1207 : Good point. OTOH, there's still much coding to do before this is ready for prime time. For current thoughts, see the new TODO.

Also, I feel I'm not the right person to spend much time engaging with an audience. I could post my thoughts on a devblog or something, but regularly responding to reader comments is probably too much.

@Katehuuh : Compatibility is a nice long-term goal, but it's not in the immediate future.

Frankly, the first I heard of Live2D specifically was when I happened to run into talkinghead. I was previously aware that VTubing is a thing, but that was about it. Then along comes this piece of tech that can animate an anime character on a GPU in realtime, based on an AI model...

...so I don't really have a handle on what features anyone but me expects. :)

@ everyone reading this:

Personally, I have an artistic and technical vision as to where I want to take this. I'm doing this for two reasons: 1) Cover my own use case of making an AI assistant character feel less "cold" to interact with, and 2) Give back to the SillyTavern community by contributing potentially useful changes.

I think ST is, at this time, the most comprehensive platform for my use case. It has a vector store that can ingest PDFs (for research use), its hardware requirements are tolerable for a laptop user, and it has a unique focus on making the AI into a character to interact with (with all the features that work toward that goal).

As for the THA3 posing engine, the fact that it works in anime style, specifically, is a major bonus for me.

New dev branch, talkinghead-next2@Technologicat. Rebased talkinghead-nextnext on the latest upstream main; new development will happen in this branch (talkinghead-next2). I will eventually delete any outdated branches.

Changelog after PR #209:

Add experimental visual postproc chain (lo-fi scifi hologram)
- TODO: This is not configurable yet; it's always on until I add a config file system.
Improve emotion preset loading logic
- Even if an emotion preset JSON is missing, load the emotion from _defaults.json.
Add blunder recovery (emotion preset factory reset) options to manual poser.
Fix factory-default preset name angry → anger
Manual poser: return nonzero exit code on init error.
Manual poser too now auto-installs THA3 models if needed.
- Same location; only one copy is installed, shared between live mode and manual poser.
Move TODO list into its own file, dump everything there.
Add a new README for the revised talkinghead.
- TODO: Add pictures, it doesn't have any yet.
- TODO: Incorporate stuff from the old talkinghead user manual.

Technologicat commented 6 months ago

As of https://github.com/Technologicat/SillyTavern-Extras/commit/162d27ede6732b16fd0e24a876621aa6d0a74e32, I think that's enough overdoing the postprocessing filters for now. Next up, refactoring the postprocessor that currently takes fully one half of app.py, and making it configurable.

EDIT: And as of https://github.com/Technologicat/SillyTavern-Extras/commit/3e0ac731949dc7762f4c8ec7bf82b6ffbeacc179, the postprocessor now lives in talkinghead/tha3/app/postprocessor.py, and has the ability to take a configuration. Format documented there, designed for easy JSONability. Now we just need a client end to manage and feed in such configurations.

@Cohee1207: I'll soon need to expose some configuration options for talkinghead, including things such as idle animation parameters (how fidgety the character is, their breathing rate, etc.) and postprocessor settings (e.g. some characters could be scifi holograms, some could look like a badly calibrated 1980s VHS tape, while most would be normal).

This needs a very small amount of string/bool/int/float options that could be stored as JSON. In the long term, per-character settings storage would be preferable. Also, I really want the settings to be modifiable live, to allow interactive experimentation with the live character's look and feel.

So a question: What is the preferred way to do this?

For example, should I modify also the main SillyTavern code, adding a new configuration panel next to Character Expressions in the client, make the server save the settings under public/characters/charactername/talkinghead.json, and make the character expressions system send those settings via a new API endpoint in SillyTavern-extras?

I can handle the extras side easily, but I haven't yet looked at the main ST code. For JavaScript, I'll need some code examples to get going, but I suppose I can get those by looking at the code of the existing config panels and at how the system currently interacts with talkinghead.

EDIT: This is essentially what I know about JS (I wrote that in early 2020, when working on a full-stack project with a Python backend).

Technologicat commented 6 months ago

Update: As of https://github.com/Technologicat/SillyTavern-Extras/commit/7876ecbe99ec65453eac9932a0f4314804ff344a, frame timing is good now.

Also, PNG is fine as transport if we drop to the fastest compression_level=1 instead of the default, tighter 6. Even at 1, the network send itself still takes under 0.15ms per frame. On my i7-12700H, a PNG encode completes in 20ms at 1 instead of 40ms at 6 (and, out of curiosity, 120ms at the maximum setting 9).

The system now uses three threads. Regardless of the global interpreter lock, in my tests this improves throughput. In general, while frame N is being sent, frame N+1 is being encoded, and frame N+2 is being rendered.

Only at most as many frames are rendered as are actually sent. Each new frame is encoded only once. The network output is isolated from any hiccups in render and/or encode. If a new frame is not available, it re-sends the latest available one.

Example on the RTX 3070 Ti mobile, underclocked to 1100 MHz to reduce fan noise. This is with some postproc filters enabled (specifically: bloom, chromatic aberration, vignetting, translucency, alpha noise, banding, scanlines):

INFO:talkinghead.tha3.app.app:render: 24.7ms [40.4 FPS available]
INFO:talkinghead.tha3.app.app:output: 40.0ms [25.0 FPS]; target 40.0ms [25.0 FPS]
INFO:talkinghead.tha3.app.app:encode: 22.9ms [43.7 FPS available]; send sync wait 6.9ms
INFO:talkinghead.tha3.app.app:render: 24.8ms [40.3 FPS available]
INFO:talkinghead.tha3.app.app:output: 40.0ms [25.0 FPS]; target 40.0ms [25.0 FPS]
INFO:talkinghead.tha3.app.app:encode: 23.2ms [43.1 FPS available]; send sync wait 6.6ms
INFO:talkinghead.tha3.app.app:render: 24.7ms [40.5 FPS available]
INFO:talkinghead.tha3.app.app:output: 40.0ms [25.0 FPS]; target 40.0ms [25.0 FPS]
INFO:talkinghead.tha3.app.app:encode: 23.2ms [43.1 FPS available]; send sync wait 5.7ms

In this example, although a render+encode combo would take ~48ms if run serially, it actually completes in ~34ms, as is seen from the ~6ms spent in "send sync wait". This means that the encoder has an encoded frame ready, but is waiting for the previous encoded frame to be consumed (sent over the network) before updating its output. At that time, the render for the next frame is already in progress; it starts in parallel as soon as the encoder starts encoding the current one.

The three-part division of responsibilities also makes it obvious which part is the slow one in CPU mode:

INFO:talkinghead.tha3.app.app:encode: 32.0ms [31.2 FPS available]; send sync wait 0.0ms
INFO:talkinghead.tha3.app.app:output: 39.9ms [25.0 FPS]; target 40.0ms [25.0 FPS]
INFO:talkinghead.tha3.app.app:render: 611.4ms [1.6 FPS available]
INFO:talkinghead.tha3.app.app:encode: 32.1ms [31.2 FPS available]; send sync wait 0.0ms
INFO:talkinghead.tha3.app.app:output: 39.9ms [25.1 FPS]; target 40.0ms [25.0 FPS]
INFO:talkinghead.tha3.app.app:render: 607.3ms [1.6 FPS available]
INFO:talkinghead.tha3.app.app:encode: 31.2ms [32.1 FPS available]; send sync wait 0.0ms
INFO:talkinghead.tha3.app.app:output: 39.9ms [25.1 FPS]; target 40.0ms [25.0 FPS]
INFO:talkinghead.tha3.app.app:render: 605.6ms [1.7 FPS available]

So yeah, now that the plugin has been optimized, it's the inference of the deep learning model. This can't be easily optimized further, so I only recommend live mode on GPU.

(I haven't looked into why the encoder is slower in CPU mode - maybe the renderer is competing for the same resources. Doesn't matter in the grand scheme of things, though. In GPU mode, the encoder runs fine, and in CPU mode, the encoder is not the bottleneck.)

Before PRing this in, I'd like to add the client-side configurability (because we have a postprocessor now and it doesn't make sense to have it always on), but I'll actually start by fixing some bugs. For details, see the TODO.

@Cohee1207: I suppose I'll just modify the SillyTavern client, too, and send PRs simultaneously to both repos?

Technologicat commented 6 months ago

EDIT: Itemized list in the TODO. Current link.

Now that the talking animation is actually working (see the PR auto-linked above), I think I'll look at the backend next.

Right now it's randomizing the mouth every frame, which at the target 25 FPS looks too fast. Early 2000s anime used ~12 FPS as the fastest actual frame rate of new cels (notwithstanding camera panning effects and similar), which might look better. Also, the mouth should probably be set to its final position as specified by the current emotion as soon as the talking animation ends. I don't recall off the top of my head whether it does that now.

Then, on an unrelated note, there's still the matter of sending postprocessor configurations from the client all the way to talkinghead so that we can get the configurable postprocessor into the users' hands.

I'm thinking that at least initially, I'll just let the user provide some JSON files (per character, in SillyTavern/public/characters/characternamehere/), and document the available filters and their settings in the talkinghead README.

Also, we could implement per-character emotion templates the same way, as JSON files in the character's folder. More stuff for loadTalkingHead to send, but I suppose that's not an issue.

And, we should fix the bug of the thumbnails in the ST GUI not updating when a new sprite is uploaded in the character expressions settings to replace an old one. Not an issue during normal use, but during testing the code, and during development of new characters, would be useful to see the correct state if I quickly upload a different talkinghead.png in the GUI settings.

Then there's some cleanliness work on the backend. The internals of set_emotion could use a better division of responsibilities. Clearly, classify wants to produce a result and then just set_emotion it. So there's a more convenient, smaller function that wants to get out from the one that takes in a dictionary of classification results. I'll fix this in the next backend PR.

Technologicat commented 6 months ago

Heads-up: upcoming changes in talkinghead-next3:

Improve talking animation, looks at least serviceable for now
- There's still a small delay at the client end, after the AI starts writing and before the animation starts, but I haven't yet figured out the cause.
Code cleanup
- Refactor model installer
- Division of responsibilities: set_emotion, set_emotion_from_classification
- Use consistent API endpoint naming scheme for the server functions implementing the endpoints
- Add some docstrings
- autopep8 server.py, and also manually make the code into more idiomatic Python (no more flake8 warnings)
- This is isolated to commit 6633535, just in case if it's not desirable.

I still have some animation reliability work to do that I want to include before PR'ing this set of backend changes in. Specifically, I'll look into decoupling the animation rate from the render framerate, to make the result look better when the rendering is slow (choppy animation would be better than a slow-motion crawl).

Technologicat commented 6 months ago

As of https://github.com/Technologicat/SillyTavern-Extras/commit/11ff18e52e001b45f718af2a3c4ac42fbd4b2b80, animation speed decoupled from render FPS.

EDIT: This and the talking animation changes are now posted in the following PRs: https://github.com/SillyTavern/SillyTavern/pull/1656 (frontend), https://github.com/SillyTavern/SillyTavern-Extras/pull/214 (backend). EDIT: Both merged as of 9 January, 2024.

Remaining TODOs for near future:

~Per-character JSON configuration (all optional, with server-side defaults):~ EDIT: Done.
- ~Animation parameters.~ EDIT: Done.
- ~Emotion templates.~ EDIT: Done.
- ~Postprocessor settings.~ EDIT: Done.
Remaining minor bugs / missing features:
- When a new sprite is uploaded in the ST client GUI settings, update the thumbnail.
- When switching chats, if classify is enabled, update the current character's emotion state from the AI's latest message.
Update the documentation (README, user manual).
- Add some screenshots.
- ~Explain how to use the config system (once we implement it).~ EDIT: Implemented and explained in the talkinghead README.

After these, I'll likely declare talkinghead feature-complete for now.

Technologicat commented 6 months ago

As of https://github.com/SillyTavern/SillyTavern-Extras/pull/216 and https://github.com/SillyTavern/SillyTavern/pull/1683, per-character configurability has been implemented (no GUI, just JSON files at least for now), and how to use it is explained in the talkinghead README.

EDIT: Both PRs have been merged as of 15 January, 2024.

biship commented 6 months ago

Fantastic effort. Thank you. I can't wait to see it.

Technologicat commented 6 months ago

Merged as of this Monday :)

After you pull the latest SillyTavern-extras from git, instructions for configuring talkinghead (and some examples) can be found in SillyTavern-extras/talkinghead/README.md. I haven't added screenshots yet, though.

Note that to enable some features, you'll need to update your SillyTavern frontend, too. These features include the postprocessor, the talking animation (while the LLM is streaming text), and /emote support. The frontend changes have been merged into the staging branch of SillyTavern.

To check that the software works, you can use the example character. Just copy SillyTavern-extras/talkinghead/tha3/images/example.png to SillyTavern/public/characters/yourcharacternamehere/talkinghead.png. (And then follow the instructions in the README to enable and configure talkinghead.)

Also, obviously, to go beyond testing with the example character, you'll need to make a talkinghead.png for your AI character. If you use Stable Diffusion, there are some tips in the README. Expect to render 100+ gens to get one suitable result, and then to spend some time in an image editor (GIMP, Photoshop, or similar) cutting the character cleanly from the background. For reference, I finished my test character in maybe an hour and half total, of which 20 minutes was spent in GIMP.

On the backend side, the next upcoming dev branch is talkinghead-next5@Technologicat. No changes since Monday's merge yet, aside from some TODO updates.

The up-to-date TODO list can be found here.

As usual, no promises which items I'll ever get around to doing. The most likely scenario is, I'll fix some bugs, polish up the documentation to a state worthy of an actual version-numbered release, and then switch to other projects for a while. I originally intended to hack on talkinghead for a week or so, but it's been a month already. :)

biship commented 6 months ago

@Technologicat I have the latest staging ST & latest STE, but do not have this option:

To enable talkinghead mode in Character Expressions, check the checkbox Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras).

I have this: Character Expressions Local server classification Show default images (emojis) if sprite missing Custom Expressions Can be set manually or with an /emote slash command. [ No custom expressions ] Sprite Folder Override Use a forward slash to specify a subfolder. Example: Bob/formal Override folder name Sprite set: Eve

Cohee1207 commented 6 months ago

Talkinghead can't be used with local classification. If you have it enabled, talkinghead is hidden and disabled.

biship commented 6 months ago

Ok, I had to uncheck Local server classification. I also had to restart both ST & STE. Now, how to I move the sprite?

Technologicat commented 6 months ago

@biship: Yeah, it sometimes gets finicky and likes to be restarted. But not very often, so I haven't found out the cause.

To check that different expressions work, you can use /emote xxx, where xxx is name of one of the 28 emotions. See e.g. the filenames of the emotion templates in SillyTavern-extras/talkinghead/emotions.

I think the Character Expressions control panel also has a full list of emotions. In fact, instead of using the /emote xxx command, clicking one of the sprite slots in that control panel should apply that expression to the character. But I'm not sure at the moment if this second way works if the character doesn't have static expression sprites.

To make the character change expression automatically based on the AI character's current reply, enable classification. Not the local one, but the one served by SillyTavern-extras. Be sure to enable the classify module in your SillyTavern-extras.

For example, my SillyTavern-extras config is:

--enable-modules=classify,talkinghead,summarize,websearch --classification-model=joeddav/distilbert-base-uncased-go-emotions-student  --summarization-model=philschmid/bart-large-cnn-samsum --talkinghead-gpu

Pruning the unrelated stuff, this should work:

--enable-modules=classify,talkinghead --classification-model=joeddav/distilbert-base-uncased-go-emotions-student --talkinghead-gpu

As for positioning the sprite on the screen, the position is currently static. Due to the base pose used by the posing engine THA3, the character's legs are always cut off at the bottom of the image, so the sprite needs to be placed at the bottom.

By the way, thanks for testing - this interaction is great for debugging what I've missed to include in the docs. :)

Technologicat commented 6 months ago

Talkinghead can't be used with local classification. If you have it enabled, talkinghead is hidden and disabled.

@Cohee1207: To think of it, is there a reason for that, other than that previous versions of talkinghead didn't have a set_emotion API endpoint? I might be able to fix this.

Technologicat commented 6 months ago

@Cohee1207, @biship: talkinghead README updated based on the latest points raised here. Thanks to both of you!

biship commented 6 months ago

@Cohee1207, @biship: talkinghead README updated based on the latest points raised here. Thanks to both of you!

Ah, yes the new instructions are more helpful, thanks.

You could mention that if you enable "Moving UI" then you can move the image. Also, unless you have "Visual Novel Mode" enabled, you can't see the image as it's behind the chat window. Also, you can't move it all the way to the left there is a good 300(ish?) pixels to on the side of image that prevents you from moving the image to actually touch the left of the screen.

Technologicat commented 6 months ago

@Cohee1207: Thanks. I wasn't aware of what "Moving UI" did (or how to use it).

At least in my installation, talkinghead appears at the left side, at the bottom, beside the actual chat window? (Tested both at 4k and at 1080p, same behavior.)

Yes, the blank background is part of the live feed itself, which is silly, but the engine is what it is. I suppose 512x512 is just a de facto standard size for AI image processing input these days.

I could add a crop filter...

Technologicat commented 6 months ago

@Cohee1207: One more thing: technically it's in the TODO, but I know you're busy, so I'll mention it here: I aim to update the user manual for talkinghead.

I think we should de-emphasize AITubing/VTubing, given the different aim of the software (animating AI character avatars, not user avatars). The new README accounts for this already.

I still have to combine any relevant information from the old user manual, and add some screenshots.

I'll finish the README first. Once done, we can then see if it should be moved to replace the old user manual.

Technologicat commented 6 months ago

And just to avoid any surprises, mentioning this too: most of the postprocessing filters now have automatic FPS correction. The postprocessor also reports its average elapsed time per frame (extras log, info level). Note that as of this writing, the postproc time is also included in the reported render time. This might still change before the next release.

The VHS glitches filter is still missing FPS correction. I'll try to fix that tomorrow (or in the next few days).

EDIT: Ah, and fixed a minor race condition in the postprocessor when replacing the filter chain on the fly.

Technologicat commented 5 months ago

@Cohee1207: I quickly added a simple crop filter to the backend, now available in talkinghead-next5, and documented in the README.

However, there seems to be some logic at the frontend side that reserves a square shape for the talkinghead sprite output, regardless of the image dimensions or aspect ratio of the actual result_feed.

This already makes the postprocessor's job lighter, since it doesn't have to handle that much empty space.

I'll need to take a closer look at the frontend...

In other news, all postprocessor filters are now framerate-independent, you can have multiple filters of the same kind in the postprocessor chain, and the TODO has been updated to more clearly indicate priorities.

Technologicat commented 5 months ago

Added server-side animator and postprocessor settings to talkinghead-next5.

Loaded from SillyTavern-extras/talkinghead/animator.json. Can provide the same settings as in SillyTavern/public/characters/yourcharacternamehere/_animator.json, except that these act as customizable server-side defaults.

Three-level system:

User-provided settings have the highest priority (from the optional client-side per-character config)
Then server-side defaults (from the new, optional server-side config)
Then the built-in hardcoded defaults

~Still need to update the docs.~ EDIT ...aaaaand updated. Now described in the talkinghead README.

Technologicat commented 5 months ago

Today's update (up to commit 6193074):

Postprocessor: brightness filters no longer affect translucency.
- Implemented via RGB<->YUV conversion.
- New lumanoise filter, sister to alphanoise.
- scanlines now has a channel parameter (can be "A" for alpha or "Y" for luminance, latter is default).
From now on, reported render time now excludes postprocessing time.
- Total render time per frame = reported render time + reported postproc time.

EDIT: Backend nearing completion for now. Posted the PR, https://github.com/SillyTavern/SillyTavern-Extras/pull/219. EDIT: Merged as of 21 January 2024.

Technologicat commented 5 months ago

~Development now in talkinghead-next6@Technologicat.~

Added analog_distort and shift_distort filters to simulate some types of bad video transports.

Next priority areas are some frontend fixes (not affecting extras, just the main ST), and polishing the documentation for release (which does affect extras).

~See updated talkinghead TODO.~

EDIT: PR created for these postproc filters: https://github.com/SillyTavern/SillyTavern-Extras/pull/221 [a.k.a. talkinghead-next6] EDIT: Merged as of 25 January 2024. EDIT: No new changes yet as of 1 February 2024, so the up to date TODO list for Talkinghead is now in the main repo, see talkinghead/TODO.md. Edited the first post at the top of this thread to clarify this.

Technologicat commented 5 months ago

Note to self: playAudioData in SillyTavern/public/scripts/extensions/tts/index.js already does lip sync with VRM (VRoid), so that's where we should inject it if we want to do that with talkinghead.

Investigate this properly (possibly much) later. THA3 has the morphs already; the remaining issue is to extract the phoneme from the TTS audio - currently I have no idea how VRM does it.

But to even develop this functionality, I'd first need to get a TTS setup working.

EDIT: Clarification: TTS is currently not a priority for me, and likely won't be in the near future, for various reasons.

Speech is hard. Traditional TTS systems have serious difficulties (or do not even try) to get intonation right and/or to match the tone of voice to the sentiment of the text, making them fall head first into the uncanny valley.
Modern AI-based TTS looks promising, but takes way too much VRAM to run it simultaneously with an LLM on a laptop. Also, sentiment extraction might not yet work well enough to get the tone right.
Writing and speech are almost completely different modes of communication. In my experience, TTS works with prose, but not with much else. Things common in the written sphere such as equations, code, numbers, acronyms, bullet point lists, and parenthetical remarks all present a problem for rendering an arbitrary text into natural-sounding speech.
Text naturally provides random access and variable-speed access. At least for me, this increases throughput while decreasing (human) working memory requirements. This makes the written word preferable over speech for certain use cases, which happen to contain the cases I currently use LLMs for.

So, while I find TTS lip-syncing an interesting technical problem, and I might look into this later; for practical use, I don't need it right now.

Technologicat commented 5 months ago

Some small client-side Talkinghead issues fixed in: https://github.com/SillyTavern/SillyTavern/pull/1790

Merged as of 6 February 2024.

EDIT: Specifically, the fixes are:

Auto-pause Talkinghead when ST tab is hidden to save GPU resources.
Add /th (alias /talkinghead) slash command to toggle Talkinghead mode in Character Expressions.
- This is convenient as a Quick Reply button, to save GPU resources (especially on a laptop) if you know you'll be AFK for a while.
When switching talkinghead off, set the correct character expression.
- Use the fallback expression only if the character's last expression can't be determined.
Consistency: When checking if talkinghead is enabled, always check also whether the Extras module is enabled.
Fix bug: Refresh the talkinghead character also on expression zip upload.