DinoMan / speech-driven-animation

949 stars 289 forks source link

About emotion expression. #56

Closed KevinZzz35 closed 3 years ago

KevinZzz35 commented 3 years ago

It is really a nice work! I just got a question about the emotion part. How could it guarantee the correctness about the emotion? Since I did not see any discriminator about the emotion. Did I miss any part of it ?

DinoMan commented 3 years ago

No, you didn't miss anything and there is no guarantee that emotion will be captured. We did not set out to model this but noticed that when we train on emotional speech the model is capable of capturing and reflecting the emotion.

KevinZzz35 commented 3 years ago

So, the model could capture happy emotion , for example, from happy speech and reflected on predicted faces, and the model was trained on multi-emotion dataset, not just on happy emotion dataset. It that correct ?

DinoMan commented 3 years ago

Yes, when trained on the CREMA-D dataset the model observes multiple emotions and seems to learn the correlation between the facial expression of the actors and the emotion in their speech. We have not evaluated this beyond visual assessment in the paper although it is certainly possible and would have been a nice addition.

KevinZzz35 commented 3 years ago

I mean, l1 loss is to guarantee the bottom half face is recovered correctly, sequeence discriminator is to guarantee recovering consecutive frames, synchronization discriminator is for synchronization. No part is to guarantee the emotion is correct, still the model could predict the frames with correct emotion. I am quite confused about this this part.

DinoMan commented 3 years ago

You are correct that there is no explicit loss guaranteeing the emotion but there are some losses that will help collectively. For example, the L1 loss will recover the bottom half of the face which is bound to capture some effect of the emotion (i.e. the effect on the bottom half of the face). The sequence discriminator doesn't just guarantee recovering consecutive frames (If it were so then a 3D CNN discriminator applied over a few consecutive frames could be used instead). The sequence discriminator witnesses the entire sequence and encourages the production of expressions (they might not always be correct but they have to be there). These expressions must be a) consistent throughout the video (i.e. no unrealistically short blinks, no inconsistent emotions) b) look realistic. Finally, the synchronization discriminator is designed specifically for synchronization but might also be helping a bit with emotion. It could be that uncorrelated emotion between the video and audio is a clear telltale sign of a fake (unsynchronised) video so the discriminator will use this information to tell them apart (forcing the generator to learn to better reflect emotions to some extent).

I believe the fact that the L1 loss and sync discriminator capture some emotional information in combination with the sequence discriminator forcing emotional expressions to look real and coherent (throughout the entire clip as well as face) could be the reason for the ability of the network to capture emotion. As a final note, this paper shows that it is possible to capture emotions from speech and reflect onto the face but emotional speech is not the main focus (which is why we haven't evaluated as thoroughly) and I believe is something that could be improved in the future.

KevinZzz35 commented 3 years ago

I got it. I ran the demo on my test data, and the emotion is not good, and I did not find it in the paper. Thank you very much !!