An interesting result in v2v mode

iloveOREO commented 1 month ago

Using v2v for expression driving, it was observed that under the same video, the results showed 'exaggerated expressions' (the mouth opens wider or closes less). Shouldn't it be exactly the same as the driving video?

https://github.com/user-attachments/assets/4366169f-fd45-4a69-ab2e-d81a93ee55d3

cleardusk commented 1 month ago

Thanks for your feedback @iloveOREO

We update the main branch to fix a small bug, and the last frame no longer has pursed lips.

We will update more details about this phenomenon tomorrow for this issue.

Mystery099 commented 1 month ago

Thanks for your feedback. @iloveOREO

If you want the source video and driving video to be the same video, and the animated video to be as similar as possible to the source video, you can use: python inference.py --no_flag_relative_motion --no_flag_do_crop. In this way, you can achieve the following result:

https://github.com/user-attachments/assets/c1e73ee1-e151-41f8-833f-ffcbb2fa3ef8

Here, we are not using relative driving, but absolute driving. The difference between the two is that --flag_relative_motion means that the motion offset of the current driving frame relative to the first driving frame will be added to the motion of the source frame as the final driving motion, while --no_flag_relative_motion means that the motion of the current driving frame will be directly used as the final driving motion.

If you use the default --flag_relative_motion, then when the source frame is a smile, and the driving frame has an expression deformation relative to the first driving frame, the expression of the animated frame will be a smile added to the smile, so the expression will be amplified. The animated video in this setting is as follows:

https://github.com/user-attachments/assets/7d36f137-cca5-4945-935b-242f240e3f56

iloveOREO commented 1 month ago

Thanks for your feedback. @iloveOREO

If you want the source video and driving video to be the same video, and the animated video to be as similar as possible to the source video, you can use: python inference.py --no_flag_relative_motion --no_flag_do_crop. In this way, you can achieve the following result:

d0--d0_concat_non_relative.mp4 Here, we are not using relative driving, but absolute driving. The difference between the two is that --flag_relative_motion means that the motion offset of the current driving frame relative to the first driving frame will be added to the motion of the source frame as the final driving motion, while --no_flag_relative_motion means that the motion of the current driving frame will be directly used as the final driving motion.

If you use the default --flag_relative_motion, then when the source frame is a smile, and the driving frame has an expression deformation relative to the first driving frame, the expression of the animated frame will be a smile added to the smile, so the expression will be amplified. The animated video in this setting is as follows:

d0--d0_concat_relative.mp4

Thank you for your reply. 'Absolute driving' performs well in this case. However, I also tried generating with different videos/IDs and found that there is always some jitter when using --no_flag_relative_motion and no relative head rotation(v2v).

https://github.com/user-attachments/assets/b91c2c69-70bb-443f-bfb5-f1b0b0f1ac1e

Initially, I thought this was caused by t_new = x_d_i_info['t'] , so I tried changing it tot_new = x_s_info['t'](sinceR_new = R_s, shouldn't this be the case?), but the results didn't change significantly. Finally, I tried setting t_new = torch.zeros(x_d_i_info['t'].size()).to(device), and found no visible difference in the generated results. So, is the main source of head jitter fromx_d_i_info['exp']?

https://github.com/user-attachments/assets/a53271f8-9eac-44a0-bb3d-764291562a2b

How can real 'absolute driving' be achieved, where only the expression is edited and the original head movement is retained? Additionally, I noticed that the paper specifically mentionedNote that the transformation differs from the scale orthographic projection, which is formulated as x = s · (x_c + δ)R + t., Could the current representation be causing instability in the generated results under the driving video due to the inability to fully decouple exp from `R?

KwaiVGI / LivePortrait

An interesting result in v2v mode #215