EvelynFan / FaceFormer

[CVPR 2022] FaceFormer: Speech-Driven 3D Facial Animation with Transformers
MIT License
761 stars 135 forks source link

@JSHZT @shivangi-aneja @xiaodongyichuan Also facing the same issue. Did you fix this problem? #98

Closed JSHZT closed 8 months ago

JSHZT commented 8 months ago
          @JSHZT @shivangi-aneja @xiaodongyichuan Also facing the same issue. Did you fix this problem?

Originally posted by @Shirley-0708 in https://github.com/EvelynFan/FaceFormer/issues/43#issuecomment-1777044199 In fact, the modifications described above are not rigorous. I don't agree with the operation of splicing with zero vectors, because the entire sequence has been templated. According to the author's original idea, the network learns the displacement relative to a specific template, so After splicing with the template and then subtracting the template, the first frame is a neutral expression with a displacement of zero. Splicing with the zero vector makes the logic of this task wrong. The way I achieve a similar purpose is to redo the data, but this is undoubtedly expensive, but other than that, I can't think of other rigorous methods, because this data and task have a tight coupling of identity and style.hope it is of help to you!

Shirley-0708 commented 8 months ago

@JSHZT Thanks for your prompt reply! Actually, I'm more concerned about why the hidden_states are all the same after the audio encoder in the prediction stage, which results in the predicted animation remaining stationary for each frame even if without the zero vectors problem. I've observed that hidden_states are clearly different during training, can this be explained by the fact that the model wasn't learned well in the training phase, and therefore when the model sees data it hasn't seen before during prediction, the hidden_states tend to be consistent? My custom dataset is Chinese, I even have thought about whether wav2vec2 in the extraction of Chinese audio features has problems. This issue troubled me for a long time, look forward to your help! As a side note, I am trying to use 3DMM coefficients and a Chinese audio dataset to learn. image