SillyTavern / SillyTavern-Extras

Extensions API for SillyTavern.
GNU Affero General Public License v3.0
516 stars 122 forks source link

talkinghead changes TODO list #206

Open Technologicat opened 6 months ago

Technologicat commented 6 months ago

EDIT: This has become more of a temporary devblog and less of a TODO list.

:exclamation: Development can move fast. Old posts are old. :exclamation: See the latest posts below for what is currently going on.

Talkinghead TODOs, as of the latest merged PR at any given moment, can be found in talkinghead/ in the main SillyTavern-extras repo.

EDIT: :exclamation: The rest of this post is old, preserved for archival purposes only. :exclamation:

This is primarily for myself to keep track of what I'm doing, as well as to record any leftover ideas that are not so likely to get done.


Technologicat commented 6 months ago

Yup, wxPython is not needed by Got rid of it, and: 30.3

This got rid of the crash on exit, too. It was likely because wx.App was running in a background thread, which is a no-no.

There's still a lot to fix. It's still eating a lot of resources, but at least it's doing something useful with them.

We'll need better idle animations, too, now that the framerate is improved.

PR upcoming eventually. In the meantime, preview here:

Technologicat commented 6 months ago

@Cohee1207: One specific question: you mentioned that you find talkinghead uncanny. Was it because of the low framerate, the quality of the idle animations, or the image quality of the AI interpolation? Trying to evaluate if I could do something.

The framerate is now at least somewhat fixed, and I might have some ideas to improve the idle animations, but the quality of the AI interpolation is what it is as long as we're running on the THA3 models. Hardware is also what it is, so I think this is the right model size right now.

But judging by the pictures in the tech reports, THA3 should be capable of rather impressive quality, given the right input. With regard to this part, how to produce a suitable input easily with SD/GIMP is the open question.

Cohee1207 commented 6 months ago

I don't find the motion produced to look pleasant or enjoyable to the degree of just a good static image. Maybe I'm biased or influenced by the publicity of some well-known "AI animations" that have a similar vibe to them (for reference).

Technologicat commented 6 months ago

Thanks for your input!

Personally, I don't see anything wrong with the animation in the link you provided, except that in animation, my own preference is 2D. Used to be an anime fan back in the day. 3D CG has always looked wrong to me. I suppose it's a matter of taste.

The specific question was because at the higher framerate, the current idle animations of talkinghead look off to me. At least now we have some speed to do something interesting with. I'll look into it.

Technologicat commented 6 months ago

As of 4a25a1e, the remnants of the IFacialMocap stuff are gone from the code. Animation logic rewritten, for great justice.

Next to clean up the repo.

And to figure out what I borked in my local git, it's telling me that origin/appfixes is not a branch (and fails to update the corresponding head), although the push to GitHub works fine.

Technologicat commented 6 months ago

Repo cleaned up. The plugin is now ~400 lines, and the code that remains looks much cleaner. :)

Technologicat commented 6 months ago

PR posted, see #207.


Planned next:


Technologicat commented 6 months ago

Implemented a framerate limiter in talkinghead-next@Technologicat.

This branch also includes a command-line flag in to choose the talkinghead compute model.

Result: render FPS: 46.0 network FPS: 23.9 render FPS: 46.7 network FPS: 23.9 render FPS: 46.6 network FPS: 24.0 render FPS: 46.8 network FPS: 24.0 render FPS: 46.1 network FPS: 23.8 render FPS: 45.5 network FPS: 23.4

The available render FPS measures how fast the animator can run on the current hardware. This is using the separable_half model (i.e. float16). Same machine as before, with an RTX 3070 Ti mobile.

The rate-limited network FPS measures the actual time between network sends, after applying the framerate limiter.

The code limits the network FPS to a hard-coded 25 (0.04 seconds wait per frame). Due to the simplistic way that I currently calculate the wait time, this sets an upper bound that's never reached exactly.

I tried more sophisticated ways to calculate the wait time, but they turned out brittle, and didn't improve the result. So I think the simplistic version is the best - ~24 FPS is just fine.

We now render only as many frames as the client consumes, so as long as the render FPS > network FPS, this will save GPU compute resources compared to the previous versions.

Next up:

Now that we always run near 24 FPS given enough GPU compute, I'll leave the timestep implementation as-is, unless there is interest in supporting lower-spec hardware that can't reach that 24 FPS.

So the next step is improving the idle animations.

EDIT: But there's also CPU mode, which runs at ~2 FPS on an i7-12700H, but may be useful for testing. Fixed the framerate limiter to work correctly also when render FPS < network FPS (in that case, the latest rendered frame is re-sent until a new one becomes available). But the animation logic needs to account for this.

Technologicat commented 6 months ago

In talkinghead-next, added:

I already have a plan how to improve the sway animation. Stay tuned...

Technologicat commented 6 months ago

Um... turns out that while testing, I had accidentally underclocked my GPU to 1100 MHz (from 1700 MHz).

I mean, I do that on purpose to reduce fan noise, but the underclock wasn't supposed to be active during the performance test. I meant to run at factory settings, to give a better idea of performance on a stock RTX 3070 Ti mobile GPU chip (for reference, 125W TDP; there are various laptop brands/models with the same GPU, but different GPU TDP).

So, rerunning the test at full clock rate. Result: render FPS: 62.5 network FPS: 24.2 render FPS: 63.1 network FPS: 24.2 render FPS: 63.4 network FPS: 24.1 render FPS: 60.0 network FPS: 24.1

Compare with the GPU underclocked: render FPS: 47.1 network FPS: 24.1 render FPS: 45.2 network FPS: 23.8 render FPS: 47.2 network FPS: 24.2 render FPS: 46.7 network FPS: 24.3

So it turns out this thing can render at 60 FPS with the separable_half model. GPU power draw is then near 80 W, though.

I can only speculate how fast it would run (and how much power it would draw) on a desktop GPU.

Technologicat commented 6 months ago

In talkinghead-next, added:

Technologicat commented 6 months ago

talkinghead-next has been PR'd, see #209.

Technologicat commented 6 months ago

One more TODO that has not been mentioned here yet:

Also, to investigate:

Technologicat commented 6 months ago

It's still experimental, but here, a small xmas present to the open source community.

talkinghead-nextnext now renders your live AI avatar as a lo-fi scifi hologram:


Yes, she's translucent:


The scanlines and noise are dynamic, and the bloom (fake HDR) imitates the look of early 2000s anime.

GPU powered, as usual. This consists essentially of a few small fragment shaders written in Torch. :P

Man, I love open source.

TODO: clean up the code, make this configurable, and see if we can improve performance (at full GPU power, 48 FPS render, 18 FPS network send; underclocked, 39 FPS render, 18 FPS network send).

Cohee1207 commented 6 months ago

Thanks, that looks cool. But I'm not sure many people frequently monitor these issues. To get a wider audience for this, do a post on resources like Reddit.

Katehuuh commented 6 months ago

I would be great to have all in on talkinghead+live2d or at last call some feature from live2d to make it more compatible.

Technologicat commented 6 months ago

@Cohee1207 : Good point. OTOH, there's still much coding to do before this is ready for prime time. For current thoughts, see the new TODO.

Also, I feel I'm not the right person to spend much time engaging with an audience. I could post my thoughts on a devblog or something, but regularly responding to reader comments is probably too much.

@Katehuuh : Compatibility is a nice long-term goal, but it's not in the immediate future.

Frankly, the first I heard of Live2D specifically was when I happened to run into talkinghead. I was previously aware that VTubing is a thing, but that was about it. Then along comes this piece of tech that can animate an anime character on a GPU in realtime, based on an AI model... I don't really have a handle on what features anyone but me expects. :)

@ everyone reading this:

Personally, I have an artistic and technical vision as to where I want to take this. I'm doing this for two reasons: 1) Cover my own use case of making an AI assistant character feel less "cold" to interact with, and 2) Give back to the SillyTavern community by contributing potentially useful changes.

I think ST is, at this time, the most comprehensive platform for my use case. It has a vector store that can ingest PDFs (for research use), its hardware requirements are tolerable for a laptop user, and it has a unique focus on making the AI into a character to interact with (with all the features that work toward that goal).

As for the THA3 posing engine, the fact that it works in anime style, specifically, is a major bonus for me.

New dev branch, talkinghead-next2@Technologicat. Rebased talkinghead-nextnext on the latest upstream main; new development will happen in this branch (talkinghead-next2). I will eventually delete any outdated branches.

Changelog after PR #209:

Technologicat commented 6 months ago

As of, I think that's enough overdoing the postprocessing filters for now. Next up, refactoring the postprocessor that currently takes fully one half of, and making it configurable.

EDIT: And as of, the postprocessor now lives in talkinghead/tha3/app/, and has the ability to take a configuration. Format documented there, designed for easy JSONability. Now we just need a client end to manage and feed in such configurations.

@Cohee1207: I'll soon need to expose some configuration options for talkinghead, including things such as idle animation parameters (how fidgety the character is, their breathing rate, etc.) and postprocessor settings (e.g. some characters could be scifi holograms, some could look like a badly calibrated 1980s VHS tape, while most would be normal).

This needs a very small amount of string/bool/int/float options that could be stored as JSON. In the long term, per-character settings storage would be preferable. Also, I really want the settings to be modifiable live, to allow interactive experimentation with the live character's look and feel.

So a question: What is the preferred way to do this?

For example, should I modify also the main SillyTavern code, adding a new configuration panel next to Character Expressions in the client, make the server save the settings under public/characters/charactername/talkinghead.json, and make the character expressions system send those settings via a new API endpoint in SillyTavern-extras?

I can handle the extras side easily, but I haven't yet looked at the main ST code. For JavaScript, I'll need some code examples to get going, but I suppose I can get those by looking at the code of the existing config panels and at how the system currently interacts with talkinghead.

EDIT: This is essentially what I know about JS (I wrote that in early 2020, when working on a full-stack project with a Python backend).

Technologicat commented 6 months ago

Update: As of, frame timing is good now.

Also, PNG is fine as transport if we drop to the fastest compression_level=1 instead of the default, tighter 6. Even at 1, the network send itself still takes under 0.15ms per frame. On my i7-12700H, a PNG encode completes in 20ms at 1 instead of 40ms at 6 (and, out of curiosity, 120ms at the maximum setting 9).

The system now uses three threads. Regardless of the global interpreter lock, in my tests this improves throughput. In general, while frame N is being sent, frame N+1 is being encoded, and frame N+2 is being rendered.

Only at most as many frames are rendered as are actually sent. Each new frame is encoded only once. The network output is isolated from any hiccups in render and/or encode. If a new frame is not available, it re-sends the latest available one.

Example on the RTX 3070 Ti mobile, underclocked to 1100 MHz to reduce fan noise. This is with some postproc filters enabled (specifically: bloom, chromatic aberration, vignetting, translucency, alpha noise, banding, scanlines): 24.7ms [40.4 FPS available] 40.0ms [25.0 FPS]; target 40.0ms [25.0 FPS] 22.9ms [43.7 FPS available]; send sync wait 6.9ms 24.8ms [40.3 FPS available] 40.0ms [25.0 FPS]; target 40.0ms [25.0 FPS] 23.2ms [43.1 FPS available]; send sync wait 6.6ms 24.7ms [40.5 FPS available] 40.0ms [25.0 FPS]; target 40.0ms [25.0 FPS] 23.2ms [43.1 FPS available]; send sync wait 5.7ms

In this example, although a render+encode combo would take ~48ms if run serially, it actually completes in ~34ms, as is seen from the ~6ms spent in "send sync wait". This means that the encoder has an encoded frame ready, but is waiting for the previous encoded frame to be consumed (sent over the network) before updating its output. At that time, the render for the next frame is already in progress; it starts in parallel as soon as the encoder starts encoding the current one.

The three-part division of responsibilities also makes it obvious which part is the slow one in CPU mode: 32.0ms [31.2 FPS available]; send sync wait 0.0ms 39.9ms [25.0 FPS]; target 40.0ms [25.0 FPS] 611.4ms [1.6 FPS available] 32.1ms [31.2 FPS available]; send sync wait 0.0ms 39.9ms [25.1 FPS]; target 40.0ms [25.0 FPS] 607.3ms [1.6 FPS available] 31.2ms [32.1 FPS available]; send sync wait 0.0ms 39.9ms [25.1 FPS]; target 40.0ms [25.0 FPS] 605.6ms [1.7 FPS available]

So yeah, now that the plugin has been optimized, it's the inference of the deep learning model. This can't be easily optimized further, so I only recommend live mode on GPU.

(I haven't looked into why the encoder is slower in CPU mode - maybe the renderer is competing for the same resources. Doesn't matter in the grand scheme of things, though. In GPU mode, the encoder runs fine, and in CPU mode, the encoder is not the bottleneck.)

Before PRing this in, I'd like to add the client-side configurability (because we have a postprocessor now and it doesn't make sense to have it always on), but I'll actually start by fixing some bugs. For details, see the TODO.

@Cohee1207: I suppose I'll just modify the SillyTavern client, too, and send PRs simultaneously to both repos?

Technologicat commented 6 months ago

EDIT: Itemized list in the TODO. Current link.

Now that the talking animation is actually working (see the PR auto-linked above), I think I'll look at the backend next.

Right now it's randomizing the mouth every frame, which at the target 25 FPS looks too fast. Early 2000s anime used ~12 FPS as the fastest actual frame rate of new cels (notwithstanding camera panning effects and similar), which might look better. Also, the mouth should probably be set to its final position as specified by the current emotion as soon as the talking animation ends. I don't recall off the top of my head whether it does that now.

Then, on an unrelated note, there's still the matter of sending postprocessor configurations from the client all the way to talkinghead so that we can get the configurable postprocessor into the users' hands.

I'm thinking that at least initially, I'll just let the user provide some JSON files (per character, in SillyTavern/public/characters/characternamehere/), and document the available filters and their settings in the talkinghead README.

Also, we could implement per-character emotion templates the same way, as JSON files in the character's folder. More stuff for loadTalkingHead to send, but I suppose that's not an issue.

And, we should fix the bug of the thumbnails in the ST GUI not updating when a new sprite is uploaded in the character expressions settings to replace an old one. Not an issue during normal use, but during testing the code, and during development of new characters, would be useful to see the correct state if I quickly upload a different talkinghead.png in the GUI settings.

Then there's some cleanliness work on the backend. The internals of set_emotion could use a better division of responsibilities. Clearly, classify wants to produce a result and then just set_emotion it. So there's a more convenient, smaller function that wants to get out from the one that takes in a dictionary of classification results. I'll fix this in the next backend PR.

Technologicat commented 6 months ago

Heads-up: upcoming changes in talkinghead-next3:

I still have some animation reliability work to do that I want to include before PR'ing this set of backend changes in. Specifically, I'll look into decoupling the animation rate from the render framerate, to make the result look better when the rendering is slow (choppy animation would be better than a slow-motion crawl).

Technologicat commented 6 months ago

As of, animation speed decoupled from render FPS.

EDIT: This and the talking animation changes are now posted in the following PRs: (frontend), (backend). EDIT: Both merged as of 9 January, 2024.

Remaining TODOs for near future:

After these, I'll likely declare talkinghead feature-complete for now.

Technologicat commented 6 months ago

As of and, per-character configurability has been implemented (no GUI, just JSON files at least for now), and how to use it is explained in the talkinghead README.

EDIT: Both PRs have been merged as of 15 January, 2024.

biship commented 6 months ago

Fantastic effort. Thank you. I can't wait to see it.

Technologicat commented 6 months ago

Merged as of this Monday :)

After you pull the latest SillyTavern-extras from git, instructions for configuring talkinghead (and some examples) can be found in SillyTavern-extras/talkinghead/ I haven't added screenshots yet, though.

Note that to enable some features, you'll need to update your SillyTavern frontend, too. These features include the postprocessor, the talking animation (while the LLM is streaming text), and /emote support. The frontend changes have been merged into the staging branch of SillyTavern.

To check that the software works, you can use the example character. Just copy SillyTavern-extras/talkinghead/tha3/images/example.png to SillyTavern/public/characters/yourcharacternamehere/talkinghead.png. (And then follow the instructions in the README to enable and configure talkinghead.)

Also, obviously, to go beyond testing with the example character, you'll need to make a talkinghead.png for your AI character. If you use Stable Diffusion, there are some tips in the README. Expect to render 100+ gens to get one suitable result, and then to spend some time in an image editor (GIMP, Photoshop, or similar) cutting the character cleanly from the background. For reference, I finished my test character in maybe an hour and half total, of which 20 minutes was spent in GIMP.

On the backend side, the next upcoming dev branch is talkinghead-next5@Technologicat. No changes since Monday's merge yet, aside from some TODO updates.

The up-to-date TODO list can be found here.

As usual, no promises which items I'll ever get around to doing. The most likely scenario is, I'll fix some bugs, polish up the documentation to a state worthy of an actual version-numbered release, and then switch to other projects for a while. I originally intended to hack on talkinghead for a week or so, but it's been a month already. :)

biship commented 6 months ago

@Technologicat I have the latest staging ST & latest STE, but do not have this option:

To enable talkinghead mode in Character Expressions, check the checkbox Extensions ⊳ Character Expressions ⊳ Image Type - talkinghead (extras).

I have this: Character Expressions Local server classification Show default images (emojis) if sprite missing Custom Expressions Can be set manually or with an /emote slash command. [ No custom expressions ] Sprite Folder Override Use a forward slash to specify a subfolder. Example: Bob/formal Override folder name Sprite set: Eve

Cohee1207 commented 6 months ago

Talkinghead can't be used with local classification. If you have it enabled, talkinghead is hidden and disabled.

biship commented 6 months ago

Ok, I had to uncheck Local server classification. I also had to restart both ST & STE. Now, how to I move the sprite?

Technologicat commented 6 months ago

@biship: Yeah, it sometimes gets finicky and likes to be restarted. But not very often, so I haven't found out the cause.

To check that different expressions work, you can use /emote xxx, where xxx is name of one of the 28 emotions. See e.g. the filenames of the emotion templates in SillyTavern-extras/talkinghead/emotions.

I think the Character Expressions control panel also has a full list of emotions. In fact, instead of using the /emote xxx command, clicking one of the sprite slots in that control panel should apply that expression to the character. But I'm not sure at the moment if this second way works if the character doesn't have static expression sprites.

To make the character change expression automatically based on the AI character's current reply, enable classification. Not the local one, but the one served by SillyTavern-extras. Be sure to enable the classify module in your SillyTavern-extras.

For example, my SillyTavern-extras config is:

--enable-modules=classify,talkinghead,summarize,websearch --classification-model=joeddav/distilbert-base-uncased-go-emotions-student  --summarization-model=philschmid/bart-large-cnn-samsum --talkinghead-gpu

Pruning the unrelated stuff, this should work:

--enable-modules=classify,talkinghead --classification-model=joeddav/distilbert-base-uncased-go-emotions-student --talkinghead-gpu

As for positioning the sprite on the screen, the position is currently static. Due to the base pose used by the posing engine THA3, the character's legs are always cut off at the bottom of the image, so the sprite needs to be placed at the bottom.

By the way, thanks for testing - this interaction is great for debugging what I've missed to include in the docs. :)

Technologicat commented 6 months ago

Talkinghead can't be used with local classification. If you have it enabled, talkinghead is hidden and disabled.

@Cohee1207: To think of it, is there a reason for that, other than that previous versions of talkinghead didn't have a set_emotion API endpoint? I might be able to fix this.

Technologicat commented 6 months ago

@Cohee1207, @biship: talkinghead README updated based on the latest points raised here. Thanks to both of you!

biship commented 6 months ago

@Cohee1207, @biship: talkinghead README updated based on the latest points raised here. Thanks to both of you!

Ah, yes the new instructions are more helpful, thanks.

You could mention that if you enable "Moving UI" then you can move the image. Also, unless you have "Visual Novel Mode" enabled, you can't see the image as it's behind the chat window. Also, you can't move it all the way to the left there is a good 300(ish?) pixels to on the side of image that prevents you from moving the image to actually touch the left of the screen.

Technologicat commented 6 months ago

@Cohee1207: Thanks. I wasn't aware of what "Moving UI" did (or how to use it).

At least in my installation, talkinghead appears at the left side, at the bottom, beside the actual chat window? (Tested both at 4k and at 1080p, same behavior.)

Yes, the blank background is part of the live feed itself, which is silly, but the engine is what it is. I suppose 512x512 is just a de facto standard size for AI image processing input these days.

I could add a crop filter...

Technologicat commented 6 months ago

@Cohee1207: One more thing: technically it's in the TODO, but I know you're busy, so I'll mention it here: I aim to update the user manual for talkinghead.

I think we should de-emphasize AITubing/VTubing, given the different aim of the software (animating AI character avatars, not user avatars). The new README accounts for this already.

I still have to combine any relevant information from the old user manual, and add some screenshots.

I'll finish the README first. Once done, we can then see if it should be moved to replace the old user manual.

Technologicat commented 6 months ago

And just to avoid any surprises, mentioning this too: most of the postprocessing filters now have automatic FPS correction. The postprocessor also reports its average elapsed time per frame (extras log, info level). Note that as of this writing, the postproc time is also included in the reported render time. This might still change before the next release.

The VHS glitches filter is still missing FPS correction. I'll try to fix that tomorrow (or in the next few days).

EDIT: Ah, and fixed a minor race condition in the postprocessor when replacing the filter chain on the fly.

Technologicat commented 5 months ago

@Cohee1207: I quickly added a simple crop filter to the backend, now available in talkinghead-next5, and documented in the README.

However, there seems to be some logic at the frontend side that reserves a square shape for the talkinghead sprite output, regardless of the image dimensions or aspect ratio of the actual result_feed.

This already makes the postprocessor's job lighter, since it doesn't have to handle that much empty space.

I'll need to take a closer look at the frontend...

In other news, all postprocessor filters are now framerate-independent, you can have multiple filters of the same kind in the postprocessor chain, and the TODO has been updated to more clearly indicate priorities.

Technologicat commented 5 months ago

Added server-side animator and postprocessor settings to talkinghead-next5.

Loaded from SillyTavern-extras/talkinghead/animator.json. Can provide the same settings as in SillyTavern/public/characters/yourcharacternamehere/_animator.json, except that these act as customizable server-side defaults.

Three-level system:

~Still need to update the docs.~ EDIT ...aaaaand updated. Now described in the talkinghead README.

Technologicat commented 5 months ago

Today's update (up to commit 6193074):

EDIT: Backend nearing completion for now. Posted the PR, EDIT: Merged as of 21 January 2024.

Technologicat commented 5 months ago

~Development now in talkinghead-next6@Technologicat.~

Next priority areas are some frontend fixes (not affecting extras, just the main ST), and polishing the documentation for release (which does affect extras).

~See updated talkinghead TODO.~

EDIT: PR created for these postproc filters: [a.k.a. talkinghead-next6] EDIT: Merged as of 25 January 2024. EDIT: No new changes yet as of 1 February 2024, so the up to date TODO list for Talkinghead is now in the main repo, see talkinghead/ Edited the first post at the top of this thread to clarify this.

Technologicat commented 5 months ago

Note to self: playAudioData in SillyTavern/public/scripts/extensions/tts/index.js already does lip sync with VRM (VRoid), so that's where we should inject it if we want to do that with talkinghead.

Investigate this properly (possibly much) later. THA3 has the morphs already; the remaining issue is to extract the phoneme from the TTS audio - currently I have no idea how VRM does it.

But to even develop this functionality, I'd first need to get a TTS setup working.

EDIT: Clarification: TTS is currently not a priority for me, and likely won't be in the near future, for various reasons.

So, while I find TTS lip-syncing an interesting technical problem, and I might look into this later; for practical use, I don't need it right now.

Technologicat commented 5 months ago

Some small client-side Talkinghead issues fixed in:

Merged as of 6 February 2024.

EDIT: Specifically, the fixes are: