SillyTavern / SillyTavern-Extras

Extensions API for SillyTavern.
GNU Affero General Public License v3.0
563 stars 133 forks source link

Talkinghead performance improvements and refactoring #207

Closed Technologicat closed 11 months ago

Technologicat commented 11 months ago

Here's the next PR for talkinghead.

NOTE: In this version, server.py always loads the live mode at float16 precision, as this saves VRAM with no visible difference in output, and should give a performance boost according to the original author (now that the live mode actually uses enough GPU compute for its inference performance to matter).

~Thus, this version requires the separable_half model files to run.~ EDIT: Autodownloader added. If the talkinghead/tha3/models/ directory is missing, the models will be automatically pulled in from HuggingFace. The default repo to install from is OktayAlpk/talking-head-anime-3, but for future-proofing, there's an option --talkinghead-models=somehfuser/somehfrepo to download from another repo. Note the plural in the option name, models.

~I intend to later add an option to choose which talkinghead model to use.~ EDIT: Added, e.g. --talkinghead-model=separable_half. The default is 'auto', which picks separable_half on GPU and separable_float on CPU.

New options TL;DR: generally, just use --talkinghead-gpu as before, and don't worry. But before the first run, delete (or rename) your talkinghead/tha3/models/ directory to trigger a one-time download of all four available THA3 models.

@Cohee1207: Opinion, please?

Cohee1207 commented 11 months ago

NOTE: In this version, server.py always loads the live mode at float16 precision, as this saves VRAM with no visible difference in output, and should give a performance boost according to the original author (now that the live mode actually uses enough GPU compute for its inference performance to matter).

Is it still possible to use talkinghead on the CPU? Or is GPU now required to run?

Thus, this version requires the separable_half model files to run.

How to get these? Are they automatically downloaded or not?

Technologicat commented 11 months ago

Is it still possible to use talkinghead on the CPU? Or is GPU now required to run?

It is still possible to run on CPU.

However, because THA3 is a deep-learning model, the performance is about what one would expect. I got ~2 FPS on an i7-12700H. The live mode really wants a GPU. For the batch export in the manual poser, CPU mode is fine.

One more thing. I remembered now that CPUs don't support float16.

I think we need a command-line option added to server.py to choose the model. I've actually already done that in my talkinghead-next branch where I'm working on the next PR, but it includes also other upcoming changes.

Anyway, in that version, the default is auto, which picks separable_half if --talkinghead-gpu is set, and separable_float otherwise. Maybe we should cherrypick that change here.

How to get these? Are they automatically downloaded or not?

I got them off the original author @pkhungurn's Dropbox link. Definitely not a long-term solution.

Some friendly user has posted a copy of the models on HuggingFace. According to @pkhungurn in his README, the models are licensed under Creative Commons Attribution 4.0 International, so redistribution is fine, but of course we don't control that particular HuggingFace repo.

Before I started on this, we already had a local copy of the separable_float model (60 MB total) in the SillyTavern-extras repo, under talkinghead/tha3/models/.

The separable_half model is essentially the same model at float16 precision. It takes about 30 MB total.

Ideally you need both - separable_float for CPU, and separable_half for GPU.

So mirror locally, set up an auto-download, or something else? What would you prefer?


Side note:

The separable_half version is actually much faster on GPU now that I've fixed the bottlenecks.

This PR renders at ~30 FPS, but the measurement might actually not be that accurate, and this version doesn't care whether anything consumes the previous frame before generating a new one.

On talkinghead-next, with smarter logic and more accurate measurements, I'm getting ~45 FPS render speed on separable_half, and ~30 FPS on separable_float. Since in that version, I'm capping the network send to 25 FPS maximum, having the renderer go faster means less GPU load.

Rate limiting is important to not DoS the SillyTavern GUI. It also seems important to constantly send something, or the GUI will hang. I found it works well if I send the frames at a smooth 24 FPS, decoupled from the actual renderer.

There are still some timing issues to work out. Rendering at separable_float outputs only 20 FPS over the network, whereas separable_half hits 24 FPS. In each case, the renderer is fast enough to have the new frame complete before the network code asks for it. Perhaps the network loop needs a smarter limiter. In my initial tests, a constant wait after each sent frame was the most reliable.

Cohee1207 commented 11 months ago

Anyway, in that version, the default is auto, which picks separable_half if --talkinghead-gpu is set, and separable_float otherwise. Maybe we should cherrypick that change here.

Yes, please. Otherwise, it will break for the CPU-only (even if it's not a valuable option, someone could use it right now). Also, I wasn't able to run it on my Mac which has no CUDA by definition, so CPU is still preferable there.

So mirror locally, set up an auto-download, or something else? What would you prefer?

If we already have a model in the repo and have no complaints about it so far, putting another one will be fine. Alternatively, we can remove the one we have and replace the download using the HF hub. I'd actually like the second option to keep the repo clones lean.

Technologicat commented 11 months ago

Cherrypicked.

(Technically, git checkout talkinghead-next server.py and then commit, since that was easiest.)

If we already have a model in the repo and have no complaints about it so far, putting another one will be fine. Alternatively, we can remove the one we have and replace the download using the HF hub. I'd actually like the second option to keep the repo clones lean.

Yes, I agree the auto-download sounds better for the GitHub repo.

Do we rely on that random HuggingFace repo, or should we create our own mirror?

And is there a preferred way to implement an auto-download from HuggingFace? These files must go into talkinghead/tha3/models/, not to ~/.cache/huggingface/hub/.

Alternatively, I could modify the loaders to look for them in the default auto-downloader location instead, but this would cause a re-download in existing installs (only 60 MB, though).

Cohee1207 commented 11 months ago

Do we rely on that random HuggingFace repo

Summarization and classification already work like this.

And is there a preferred way to implement an auto-download from HuggingFace? These files must go into talkinghead/tha3/models/, not to ~/.cache/huggingface/hub/.

If the huggingface_hub package can't handle downloads to a custom folder, it could be a regular requests download

Technologicat commented 11 months ago

Ok. Thanks for the pointers!

I'll see about implementing the auto-download later tonight.

Technologicat commented 11 months ago

@Cohee1207 Autodownloader added. huggingface_hub worked fine for this.

By default, the autodownload pulls from OktayAlpk/talking-head-anime-3, but for future-proofing, there's a new --talkinghead-models=somehfuser/somehfrepo option (note the plural, models) that installs from a user-specified HF repo.

One thing I was thinking of, should we use the symlink mode of download_snapshot? All civilized OSs have them, but some users are probably on MS Windows.

In huggingface_hub, installing with symlinks is great if multiple programs need the same models, but in the case of THA3, the model is rare enough that perhaps it's not an issue, and better compatibility with all OSs is preferable.

Cohee1207 commented 11 months ago

Cool, glad it worked so smoothly. Don't think symlinks are going to be that much of a big deal. Leave them as plain files, is fine.

Technologicat commented 11 months ago

Yeah, I was positively surprised.

Ok, will leave them as plain files. Actually that's what it already does. :)

Anything else to change within the scope of this PR?

(Note the scope of refactoring and optimization. Feature improvements are already going into talkinghead-next, which I'll PR later once it's in an acceptable shape.)

Cohee1207 commented 11 months ago

Btw, regarding the streaming performance, I think we can look into replacing a PNG stream with webm videos with a transparent background (alpha channel) encoded with ffmpeg using yuva420p codec. That could boost performance (or not), but worth trying.

Technologicat commented 11 months ago

Probably a good idea. OTOH, the PNG streaming can do 24 FPS (probably more but it'll DoS the client), which is enough for anime, so I'm not sure how large the practical performance gain from switching to a proper video encoder would be. Having the possibility to stream a 60 FPS talkinghead on high-end GPUs would be nice, though.

Also, I'm not that familiar with encoding video from inside a Python app. Perhaps a separate PR for this later?

Cohee1207 commented 11 months ago

MacBook Pro performance in CPU mode doesn't look too good, but otherwise it's ready to merge

image
Technologicat commented 11 months ago

Thanks!

Yeah, CPU mode is horribly slow. Deep learning doing what deep learning does best, consuming compute resources (Thompson et al., 2020)...

Since the filesize of the separable_half model is 30 MB, at two bytes per weight this suggests the neural net size is approximately 15M parameters. At float32 precision, with a program designed primarily for GPUs, I'd say that's gonna hit a CPU hard. :)

(Section 3.4 of the @pkhungurn's original tech report discusses the architecture. From skimming it, I didn't find an authoritative total number of parameters, but I did learn that the separable model variants use depthwise separable convolutions, whence the name.)

The unpublished THA4 version, discussed in the arXiv paper by the same author, adds distillation to create a smaller student network.

While this would also be an interesting possibility for increasing performance regardless of device, it has the drawbacks that 1) the full THA4 is even slower than THA3, 2) the student has to be trained separately for each character, and 3) since the models and code are not published, it would be too much work to replicate independently (from the tech reports, the original author has been developing the THA model family at least since 2019).

Per-character training is not a problem for AItubers, or for use in a visual novel game, sure. But I think talkinghead occupies a different, underappreciated niche in the problem space.

At least for me, the draw of THA3 is in that it's a general AI model that aims to pose any anime-style character it hasn't seen before, given just one suitable input image. Furthermore, on a GPU, we can do that in realtime, so we can treat the AI-modeled character pretty much like a traditional 3D-modeled character, synthesizing animation trajectories on the fly. Now that's magic.

(Who would have thought that we'd task a GPU with... generating graphics?)