Closed Technologicat closed 11 months ago
NOTE: In this version, server.py always loads the live mode at float16 precision, as this saves VRAM with no visible difference in output, and should give a performance boost according to the original author (now that the live mode actually uses enough GPU compute for its inference performance to matter).
Is it still possible to use talkinghead on the CPU? Or is GPU now required to run?
Thus, this version requires the separable_half model files to run.
How to get these? Are they automatically downloaded or not?
Is it still possible to use talkinghead on the CPU? Or is GPU now required to run?
It is still possible to run on CPU.
However, because THA3 is a deep-learning model, the performance is about what one would expect. I got ~2 FPS on an i7-12700H. The live mode really wants a GPU. For the batch export in the manual poser, CPU mode is fine.
One more thing. I remembered now that CPUs don't support float16
.
I think we need a command-line option added to server.py
to choose the model. I've actually already done that in my talkinghead-next
branch where I'm working on the next PR, but it includes also other upcoming changes.
Anyway, in that version, the default is auto
, which picks separable_half
if --talkinghead-gpu
is set, and separable_float
otherwise. Maybe we should cherrypick that change here.
How to get these? Are they automatically downloaded or not?
I got them off the original author @pkhungurn's Dropbox link. Definitely not a long-term solution.
Some friendly user has posted a copy of the models on HuggingFace. According to @pkhungurn in his README, the models are licensed under Creative Commons Attribution 4.0 International, so redistribution is fine, but of course we don't control that particular HuggingFace repo.
Before I started on this, we already had a local copy of the separable_float
model (60 MB total) in the SillyTavern-extras repo, under talkinghead/tha3/models/
.
The separable_half
model is essentially the same model at float16
precision. It takes about 30 MB total.
Ideally you need both - separable_float
for CPU, and separable_half
for GPU.
So mirror locally, set up an auto-download, or something else? What would you prefer?
Side note:
The separable_half
version is actually much faster on GPU now that I've fixed the bottlenecks.
This PR renders at ~30 FPS, but the measurement might actually not be that accurate, and this version doesn't care whether anything consumes the previous frame before generating a new one.
On talkinghead-next
, with smarter logic and more accurate measurements, I'm getting ~45 FPS render speed on separable_half
, and ~30 FPS on separable_float
. Since in that version, I'm capping the network send to 25 FPS maximum, having the renderer go faster means less GPU load.
Rate limiting is important to not DoS the SillyTavern GUI. It also seems important to constantly send something, or the GUI will hang. I found it works well if I send the frames at a smooth 24 FPS, decoupled from the actual renderer.
There are still some timing issues to work out. Rendering at separable_float
outputs only 20 FPS over the network, whereas separable_half
hits 24 FPS. In each case, the renderer is fast enough to have the new frame complete before the network code asks for it. Perhaps the network loop needs a smarter limiter. In my initial tests, a constant wait after each sent frame was the most reliable.
Anyway, in that version, the default is auto, which picks separable_half if --talkinghead-gpu is set, and separable_float otherwise. Maybe we should cherrypick that change here.
Yes, please. Otherwise, it will break for the CPU-only (even if it's not a valuable option, someone could use it right now). Also, I wasn't able to run it on my Mac which has no CUDA by definition, so CPU is still preferable there.
So mirror locally, set up an auto-download, or something else? What would you prefer?
If we already have a model in the repo and have no complaints about it so far, putting another one will be fine. Alternatively, we can remove the one we have and replace the download using the HF hub. I'd actually like the second option to keep the repo clones lean.
Cherrypicked.
(Technically, git checkout talkinghead-next server.py
and then commit, since that was easiest.)
If we already have a model in the repo and have no complaints about it so far, putting another one will be fine. Alternatively, we can remove the one we have and replace the download using the HF hub. I'd actually like the second option to keep the repo clones lean.
Yes, I agree the auto-download sounds better for the GitHub repo.
Do we rely on that random HuggingFace repo, or should we create our own mirror?
And is there a preferred way to implement an auto-download from HuggingFace? These files must go into talkinghead/tha3/models/
, not to ~/.cache/huggingface/hub/
.
Alternatively, I could modify the loaders to look for them in the default auto-downloader location instead, but this would cause a re-download in existing installs (only 60 MB, though).
Do we rely on that random HuggingFace repo
Summarization and classification already work like this.
And is there a preferred way to implement an auto-download from HuggingFace? These files must go into talkinghead/tha3/models/, not to ~/.cache/huggingface/hub/.
If the huggingface_hub package can't handle downloads to a custom folder, it could be a regular requests download
Ok. Thanks for the pointers!
I'll see about implementing the auto-download later tonight.
@Cohee1207 Autodownloader added. huggingface_hub
worked fine for this.
talkinghead/tha3/models/
if that local directory is missing.By default, the autodownload pulls from OktayAlpk/talking-head-anime-3, but for future-proofing, there's a new --talkinghead-models=somehfuser/somehfrepo
option (note the plural, models) that installs from a user-specified HF repo.
One thing I was thinking of, should we use the symlink mode of download_snapshot
? All civilized OSs have them, but some users are probably on MS Windows.
In huggingface_hub
, installing with symlinks is great if multiple programs need the same models, but in the case of THA3, the model is rare enough that perhaps it's not an issue, and better compatibility with all OSs is preferable.
Cool, glad it worked so smoothly. Don't think symlinks are going to be that much of a big deal. Leave them as plain files, is fine.
Yeah, I was positively surprised.
Ok, will leave them as plain files. Actually that's what it already does. :)
Anything else to change within the scope of this PR?
(Note the scope of refactoring and optimization. Feature improvements are already going into talkinghead-next
, which I'll PR later once it's in an acceptable shape.)
Btw, regarding the streaming performance, I think we can look into replacing a PNG stream with webm videos with a transparent background (alpha channel) encoded with ffmpeg using yuva420p codec. That could boost performance (or not), but worth trying.
Probably a good idea. OTOH, the PNG streaming can do 24 FPS (probably more but it'll DoS the client), which is enough for anime, so I'm not sure how large the practical performance gain from switching to a proper video encoder would be. Having the possibility to stream a 60 FPS talkinghead on high-end GPUs would be nice, though.
Also, I'm not that familiar with encoding video from inside a Python app. Perhaps a separate PR for this later?
MacBook Pro performance in CPU mode doesn't look too good, but otherwise it's ready to merge
Thanks!
Yeah, CPU mode is horribly slow. Deep learning doing what deep learning does best, consuming compute resources (Thompson et al., 2020)...
Since the filesize of the separable_half
model is 30 MB, at two bytes per weight this suggests the neural net size is approximately 15M parameters. At float32
precision, with a program designed primarily for GPUs, I'd say that's gonna hit a CPU hard. :)
(Section 3.4 of the @pkhungurn's original tech report discusses the architecture. From skimming it, I didn't find an authoritative total number of parameters, but I did learn that the separable
model variants use depthwise separable convolutions, whence the name.)
The unpublished THA4 version, discussed in the arXiv paper by the same author, adds distillation to create a smaller student network.
While this would also be an interesting possibility for increasing performance regardless of device, it has the drawbacks that 1) the full THA4 is even slower than THA3, 2) the student has to be trained separately for each character, and 3) since the models and code are not published, it would be too much work to replicate independently (from the tech reports, the original author has been developing the THA model family at least since 2019).
Per-character training is not a problem for AItubers, or for use in a visual novel game, sure. But I think talkinghead
occupies a different, underappreciated niche in the problem space.
At least for me, the draw of THA3 is in that it's a general AI model that aims to pose any anime-style character it hasn't seen before, given just one suitable input image. Furthermore, on a GPU, we can do that in realtime, so we can treat the AI-modeled character pretty much like a traditional 3D-modeled character, synthesizing animation trajectories on the fly. Now that's magic.
(Who would have thought that we'd task a GPU with... generating graphics?)
Here's the next PR for
talkinghead
.talkinghead
module is enabled, and also exits faster.talkinghead/tha3/app/app.py
now only provides the live mode.talkinghead/tha3/app/util.py
.NOTE: In this version,
server.py
always loads the live mode atfloat16
precision, as this saves VRAM with no visible difference in output, and should give a performance boost according to the original author (now that the live mode actually uses enough GPU compute for its inference performance to matter).~Thus, this version requires the
separable_half
model files to run.~ EDIT: Autodownloader added. If thetalkinghead/tha3/models/
directory is missing, the models will be automatically pulled in from HuggingFace. The default repo to install from is OktayAlpk/talking-head-anime-3, but for future-proofing, there's an option--talkinghead-models=somehfuser/somehfrepo
to download from another repo. Note the plural in the option name, models.~I intend to later add an option to choose which
talkinghead
model to use.~ EDIT: Added, e.g.--talkinghead-model=separable_half
. The default is 'auto', which picksseparable_half
on GPU andseparable_float
on CPU.New options TL;DR: generally, just use
--talkinghead-gpu
as before, and don't worry. But before the first run, delete (or rename) yourtalkinghead/tha3/models/
directory to trigger a one-time download of all four available THA3 models.@Cohee1207: Opinion, please?