andrerochow / fsrt

Official implementation of the CVPR 2024 paper "FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features"
https://andrerochow.github.io/fsrt
67 stars 4 forks source link

Great work! #2

Closed LeslieZhoa closed 1 month ago

LeslieZhoa commented 1 month ago

It is a very good work! I want to know how to make the id better?

Inferencer commented 1 month ago

I thought the fidelity was on point for a one-shot approach are you having a different experience with in the wild images?

harisreedhar commented 1 month ago

I thought the fidelity was on point for a one-shot approach are you having a different experience with in the wild images?

Nope fidelity is little off, it will look like different person. But in the paper it is mentioned that using multiple source can increase fidelity.

source:

source

result:

https://github.com/andrerochow/fsrt/assets/46858047/3240888b-c70a-424b-b6c9-d970efceb320

Inferencer commented 1 month ago

I haven't read the paper yet so will be interested in the multiple source you mentioned, I wonder if the starting driving frame was in a similar position to the source image if that would help, I'm going to attempt to drive with a audio to speech 3dmm

johndpope commented 1 month ago

maybe you can use new upscaler by @hamadichihaoui to great effect - https://github.com/hamadichihaoui/BIRD this was trained on 256 voxceleb2 that original image you have looks like 1024? so it's degraded down. this codebase is only doing 256. maybe @andrerochow can share training scripts?

(side note - i see from some config file that this is using keypoints - latest research by microsoft is saying just use resnets https://www.microsoft.com/en-us/research/project/vasa-1/ that work is based off megaportraits - which i attempt to recreate here (with some success) - https://github.com/johndpope/MegaPortrait-hack )

Inferencer commented 1 month ago

maybe you can use new upscaler by @hamadichihaoui to great effect - https://github.com/hamadichihaoui/BIRD this was trained on 256 voxceleb2 that original image you have looks like 1024? so it's degraded down. this codebase is only doing 256. maybe @andrerochow can share training scripts?

(side note - i see from some config file that this is using keypoints - latest research by microsoft is saying just use resnets https://www.microsoft.com/en-us/research/project/vasa-1/ that work is based off megaportraits - which i attempt to recreate here (with some success) - https://github.com/johndpope/MegaPortrait-hack )

John you are everywhere good to see you, yes it's voxceleb256 needs to be trained on a better dataset tbh but its good enough for demo purposes, the upcoming portrait-4dv2 and invertavatar also look good but this is what we have for now without jittering and need for feature similarity in cross-identity usage, the relative motion transfer is also a really cool feature.

I haven't researched the latest upscalers in nearly a year as i default to gfpgan and codeformer but that one (bird) one looks really good ty for sharing

I'll see what i can do with this repo, I'm going to throw it up on HuggingFace ZeroGPU in a bit and do my main optimizations in this repo SickFace then merge with LipSick

Inferencer commented 1 month ago

this is the huggingface space it works but it needs more optimizations and more text like the fsrt links & acknowledgments etc https://huggingface.co/spaces/Inferencer/SickFace

andrerochow commented 1 month ago

Hi there!

If you want to maximize ID preservation, you should try to animate with relative motion transfer. Absolute motion transfer often leads to shape deformation, which reduces ID preservation.

johndpope commented 1 month ago

nice work Andre - I see this paper referencing this paper - https://arxiv.org/pdf/2405.16204#page=2.17

Inferencer commented 1 month ago

I looked into using multiple source images and using 3 source images increased the inference time by 4x the resulting files below for comparison where left is 1 source and right is 3 source, clear shine removal from hair & nose fidelity loss with 3 source

https://github.com/andrerochow/fsrt/assets/121839197/cf10f4e0-47af-4673-9cd4-154d571f7ce0

andrerochow commented 1 month ago

I looked into using multiple source images and using 3 source images increased the inference time by 4x the resulting files below for comparison where left is 1 source and right is 3 source, clear shine removal from hair & nose fidelity loss with 3 source

1v3.mp4

If you want to use multiple source images, you cannot use the checkpoint of a model trained with only one source image. As mentioned in the paper, only models trained with at least two source images generalize to a flexible amount of source images during inference.

The checkpoint of our model trained with two source images (vox256_2Source.pt) is now also available for download.