OpenTalker / DPE

[CVPR 2023] DPE: Disentanglement of Pose and Expression for General Video Portrait Editing
https://carlyx.github.io/DPE/
MIT License
417 stars 45 forks source link

Different from LIA #2

Closed KhalilWong closed 1 year ago

KhalilWong commented 1 year ago

In LIA, "output video" contain the same proportion of body+head parts as in "source image". But in DPE, the proportion of body+head parts in "output video" is intuitively determined by "driving video", similar to a cropping process.

Is this result caused by the pre-trained model? Can I control the percentage of cropping?

Carlyx commented 1 year ago

In fact, due to absolute driving, the proportion of the pose-driven stage is close to the driving image. But there is no such problem for expression transfer. A theoretical solution to this is to make face proportions of the driving image and the source image close during the data process by padding operation (DPE has no special requirements for face cropping, so it can be adjusted according to the situation).

DPE focuses on the transfer of expression, so we do not specifically deal with the proportion problem of pose driving.

KhalilWong commented 1 year ago

Yes, actually it's true. As I tested in the case where the proportion of body in driving image is smaller than that in source image: in pose DRIVING mode, source image is cropped according to the driving image; in expression DRIVING mode, source image appears to be cropped and then automatically padded, like this. first frame of d.mp4 d.mp4 first frame of edit.mp4 edit.mp4 first frame of s.mp4 s.mp4 The bottom padding part is difficult to be restored by GFPGAN or other models. It means that even making face proportions of the driving image and the source image CLOSE, the image quality of bottom part is not such good. At the same time, the image quality of face part doesn't seem to be that good either.

I have two questions. I'm curious what causes this problem that different from LIA? Then as I see in your project page, all output videos have a large proportion of body in video EDITING cases. Do I need to wait for a pre-trained model for video EDITING to be released if I want a output video with large proportion of body?

Carlyx commented 1 year ago

The links are broken and I can't see the results.

2 Video editing There are three steps for editing: (1) Data processing: Give a full-size video (F), such as containing the entire body, we use the face detection method to crop faces and we can get the input video (A). For input video, the face proportions are similar to those shown in other methods. (2) Face driving: We sent A into DPE and the expression transfer is performed. And we can get the results (B). (3) Pasting back: We paste B back into F.

In fact, DPE can achieve independent control over the expression, further making it easier to paste back into the full-size video. The pre-trained model is for step (2). And we will upload the code for cropping(1) and pasting(3) later.

1 Face proportion In practice, we observe that pose transfer is much easier to achieve than expression transfer. And there is no strong supervision in our decoupling framework. Therefore, the model tends to learn the face proportions in pose transfer. On the contrary, the expression transfer is better for proportions. BTW, video editing is only for the editing of expression.

KhalilWong commented 1 year ago

Thanks for your patience. I'll try it following these steps, although pasting back may create some minor issues like alignment.

reupload 3 pics: first frame of d.mp4 d mp4 first frame of edit.mp4 edit mp4-marked first frame of s.mp4 s mp4

Carlyx commented 1 year ago

I will upload the pasting code ASAP. The face proportions in this example are small compared to the VOX dataset we used for training, which may be a cause of artifacts.

KhalilWong commented 1 year ago

Agree and thanks!