Audio driven lip syncing capabilities

danablend commented 2 months ago

I'm sure I'm not the only one who would love to use this for audio-driven video editing, particularly for lip syncing.

At the moment, I have successfully gotten it to work by chaining together @Inferencer's LipSick library with LivePortrait, and results are decent.

For LivePortrait motion we have either two options:

Use relative motion (frame-to-frame deltas)
Use absolute motion

I have opted for relative motion, because the absolute motion introduces too much video stuttering in my experience.

However, the relative motion lip movement is not pronounced enough for me to get great results (although results are decent). This could be because LipSick's lip syncing movements are not pronounced enough for LivePortrait's relative motion to perform at its best, or we might be able to modify the LivePortrait code slightly to increase the weight of the relative motion differences, to make it a little more similar to how absolute motion would be, while avoiding the video stuttering.

I'm currently playing with code in this area, trying to dial in the lip movement - I'm by no means an expert here, so it's mostly trial and error:

# File: liveportrait/src/live_portrait_pipeline.py

if inf_cfg.flag_relative_motion:
    if flag_is_source_video:
        if inf_cfg.flag_video_editing_head_rotation:
            R_new = x_d_r_lst_smooth[i]
        else:
            R_new = R_s
    else:
        R_new = (R_d_i @ R_d_0.permute(0, 2, 1)) @ R_s

    delta_new = x_d_exp_lst_smooth[i] if flag_is_source_video else x_s_info['exp'] + (x_d_i_info['exp'] - x_d_0_info['exp'])
    scale_new = x_s_info['scale'] if flag_is_source_video else x_s_info['scale'] * (x_d_i_info['scale'] / x_d_0_info['scale'])
    t_new = x_s_info['t'] if flag_is_source_video else x_s_info['t'] + (x_d_i_info['t'] - x_d_0_info['t'])
else:
    if flag_is_source_video:
        if inf_cfg.flag_video_editing_head_rotation:
            R_new = x_d_r_lst_smooth[i]
        else:
            R_new = R_s
    else:
        R_new = R_d_i
    delta_new = x_d_exp_lst_smooth[i] if flag_is_source_video else x_d_i_info['exp']
    scale_new = x_s_info['scale']
    t_new = x_d_i_info['t']

Alternatively to using LipSick for generating the video to drive the LivePortrait model, the authors of LivePortrait used FaceFormer together with Whisper for audio driven results. Might be worth a shot for highly expressive and fast inference times?

The most successful LivePortrait configuration I have found so far for lip syncing are the following settings:

self.inference_cfg = InferenceConfig(
    flag_crop_driving_video=True,
    flag_normalize_lip=True,
    flag_use_half_precision=False,
    flag_eye_retargeting=False,
    flag_lip_retargeting=True,
    flag_pasteback=True,
    flag_stitching=True,
    flag_relative_motion=True,
    flag_do_rot=False,
    flag_video_editing_head_rotation=False,
    flag_source_video_eye_retargeting=False,
    flag_do_crop=True,
)

Wanting to start this thread so we can pitch in together and get audio driven editing to work very well!

Inferencer commented 2 months ago

depends on the method if you need to drive image or video, as driving an img you will need head motion so faceformer wouldn't be good, however driving a video would be ideal to have a static audio driven animation like faceformer,

I'm moving out of the A.I CV field so wont be much help as I am now a month behind on my daily research but a quick tip for finding sota or recent opensource is to search arvix click search all fields then search for "lip" then if that's all been checked to search "avatar" a quick search just now came up for 20 I haven't looked at such as UniTalker which has an excellent parameter based audio23dmm You will normally see a link to a project page or github in the description of the arvix by clicking the "more" button at the end the of description and sometimes you will need to open the pdf and ctrl + F "git" to see if they have a link hidden in the paper.

Arvrairobo commented 2 months ago

we are also looking for a similar feature @danablend . by any luck were you able to achieve audio driven portrait? would love to see your code and help you to build it or finetune. do let me know

UltraClr commented 2 months ago

could you show some of the better videos, as well as the generation speed（by using lipsick）?

Arvrairobo commented 2 months ago

same here, waiting for it @danablend

UltraClr commented 2 months ago

@cleardusk Thanks for your teams' excellent work!Could you tell me how to control the lip features, sometimes a 0 opening and closing ratio can make the lips look weird. Do you have a good way to solve the problem that the lips can be properly still in vid2vid situation when the driving video is not moving?

huipengo commented 2 months ago

LipSick library with LivePortrait

Can you share your project code? thank you

KwaiVGI / LivePortrait

Audio driven lip syncing capabilities #310