Unable to perceive improvement.

daikankan commented 10 months ago

Thanks for sharing this work, good insight and inspiring. But I'm unable to perceive improvement of the pretrained model. My inference of E_expression: For images, inputs: just concat same image (1, 5, 224, 224), output: Flame parameters (5, 53), chose the center one's param ouput[2, :] For videos, inputs: just concat same frame (1, 5, 224, 224), output: Flame parameters (5, 53), chose the center one's param ouput[2, :]. Or maybe I should try to concat continous 5 frames for E_expression input ? My mean_shape for alignment is consist with author's. Comparison of the results (talkinghead videos and single image reconstruction) between E_flame_without_E_expression and E_flame_with_expression:

E_flame_without_E_expression:

https://github.com/filby89/spectre/assets/20749514/84449ac3-c6a9-4d59-85ee-5267f97166a8

msk_E_flame_without_E_expression

obm_E_flame_without_E_expression

E_flame_with_expression:

https://github.com/filby89/spectre/assets/20749514/3d44a0e7-2a9b-4ab6-8d84-29eacf3ff03d

msk_E_flame_with_E_expression

obm_E_flame_with_E_expression

Sorry, my test maybe not sufficient，and my preprocess maybe not accurate.

filby89 commented 7 months ago

Hey, thanks for your interest and for bringing this up. Generally SPECTRE is trained on videos where a human talks using a perceptual lipread loss between the original video and the rendered video. The lipread loss increases the perception of speech from the output 3D mesh. Note however, that the perception of speech is not depicted in the ~18 2D mouth landmarks you show here. This is an important reason why methods which score lower error for landmark placements are not necessarily better in terms of human perception (geometric errors do not correlate with human perception).

A better way to compare SPECTRE with another method would be by rendering the out 3D mesh in a video and comparing the two visually.

Also a final note: in some cases the lipread loss will even exaggerate the mouth a bit (e.g. add more protrusion and roundedness than visible) in order to better capture the perception of speech, which will result in even worse landmark placement compared to other methods.

agupta54 commented 6 months ago

Hi @daikankan can you please explain how you are pasting back the rendered avatar back in the video?

daikankan commented 6 months ago

@agupta54

just opencv: circle rectangle and puttext

filby89 / spectre

Unable to perceive improvement. #17