Disentangling lips motions from head motion: Experimenting

sbersier commented 1 year ago

Hi, SadTalker is really nice. Good job! Thanks!

I'm currently experimenting with it and noticed that expression_scale doesn't disentangle head motion from lips motion.

So, I modified the code (see below) in order to be able to control "head motion amplitude" and "mouth motion amplitude" independently.

For example, it allows me to generate a video with:

python inference.py --driven_audio audio.wav --source_image image.png --size 256 --enhancer gfpgan --pose_style 2 --expression_scale 1.5 --head_motion_scale 0.5

This will amplify mouth motions by 1.5 while scaling down head motions by a factor 0.5

Note: if you just specify expression_scale then head_motion_scale is 1.0 by default (just like usual)

Here is an example:

https://github.com/OpenTalker/SadTalker/assets/34165937/b83ca0e7-2249-4b57-b0ca-79688834b868

The above example was generated with: python inference.py --driven_audio audio.wav --source_image image.png --size 256 --enhancer gfpgan --pose_style 2 --expression_scale 1 --head_motion_scale 1 (on the left) and python inference.py --driven_audio audio.wav --source_image image.png --size 256 --enhancer gfpgan --pose_style 2 --expression_scale 2 --head_motion_scale 0.5 (on the right)

If you want to explore a bit, you can also try setting expression_scale to 0.0 while setting head_motion_scale to 2.0 Or you can try setting expression_scale to 2.0 while setting head_motion_scale to 0.0

I found it quiet fun and quiet informative to play with it. For example: Setting head_motion_scale to 0.0 and expression_scale to 1.0, we see that when the character blinks, the head moves a bit. It looks like head motion and eyes blinks are not as disentangled as I would expect.

What do you think?

Best regards, SB

The modifications to the code:

NOTE: Before modifying the code, I would recommend to make a copy of SadTalker/inference.py and SadTalker/src/generate_facerender_batch.py and put these copies in a safe place. So that you can always revert back.

A) In SadTalker/inference.py

On line 85:

Replace:

expression_scale=args.expression_scale, still_mode=args.still, preprocess=args.preprocess, size=args.size)

With the following:

expression_scale=args.expression_scale, head_motion_scale=args.head_motion_scale, still_mode=args.still, preprocess=args.preprocess)

On line:109

Replace:

parser.add_argument("--expression_scale", type=float, default=1., help="the batch size of facerender")

With the two following lines:

parser.add_argument("--expression_scale", type=float, default=1.,  help="expression scale")
parser.add_argument("--head_motion_scale", type=float, default=1.,  help="head motion scale")

B) In SadTalker/src/generate_facerender_batch.py

On line: 10

Replace:

expression_scale=1.0, still_mode = False, preprocess='crop', size = 256):

with the following:

expression_scale=1.0, head_motion_scale=1.0, still_mode = False, preprocess='crop', size = 256):

On line: 44

Replace:

generated_3dmm[:, :64] = generated_3dmm[:, :64] * expression_scale

with the following two lines:

generated_3dmm[:, :64] = generated_3dmm[:, :64] * expression_scale
generated_3dmm[:, 64:] = generated_3dmm[:, 64:] * head_motion_scale

parthagorai commented 1 year ago

Please guide me on how I can control eye blinking. Thanks

sbersier commented 1 year ago

@parthagorai The expression_scale factor includes mouth and eyes. And I don't think it is possible to separate them this way.

But you can always use a reference video for the eye blinks.

For example:

python inference.py --ref_eyeblink eyeblink.mp4 --driven_audio audio.wav --source_image image.png --size 256 --enhancer gfpgan --pose_style 2 --expression_scale 1.5 --head_motion_scale 0.5

Where eyeblink.mp4 would point to a video of a real person blinking eyes. But I'm not sure if it keeps the eye lids motion (including the "scale") or if it just detects that the person has blinked at that particular moment. Could be worth to try.

parthagorai commented 1 year ago

Thanks for your reply @sbersier.

I'd be happy if you could provide guidance on achieving more natural head motion and expressions, similar to what other AI tools like Heygen AI offer. I am using preprocess : full

sbersier commented 1 year ago

@parthagorai Well, you can always use a video to drive the head motion. Something like:

python inference.py --ref_pose reference_pose.mp4 --ref_eyeblink eyeblink.mp4 --driven_audio audio.wav --source_image image.png --size 256 --enhancer gfpgan --pose_style 2 --expression_scale 1.5 --head_motion_scale 0.5

With reference_pose.mp4 a video (it may be better if the person is not talking). SadTalker will then copy the head motion from reference_pos.mp4 and copy the eye blinks from eye_blink.mp4

If you combine this with expression_scale and head_motion_scale factors, I think the result will be as good as SadTalker can be...

hashnimo commented 9 months ago

Thank you so much for this tutorial and for making it easy to understand. I was wondering how I can get the teeth visibility and other features discussed here, and now it finally works.

OpenTalker / SadTalker

Disentangling lips motions from head motion: Experimenting #562