eeskimez / emotalkingface

The code for the paper "Speech Driven Talking Face Generation from a Single Image and an Emotion Condition"
MIT License
161 stars 29 forks source link

how to preserve a better identity? #2

Open loboere opened 2 years ago

loboere commented 2 years ago

When I use an image of mine it is distorted a lot and does not look like the original face, is there any parameter to improve the identity?

yzyouzhang commented 2 years ago

I am not sure, by "distorted", if you mean the resolution issue. We trained our model with a low-resolution 128x128 due to the limitation of the dataset and our computation resources. You can try to modify the resolution and retrain the model, or you could try some super-resolution methods. To improve the identity, maybe you can try to reweigh the losses. Thanks.

ujjawalcse commented 1 year ago

Yaa, I also tried this repo and noticed the identity loss. The character in the resultant video doesn't look like the reference image we provide as input.

What do you mean by reweigh the losses @yzyouzhang ? Thanks.

yzyouzhang commented 1 year ago

Hi, I mean increase the weight for the identity loss in the total loss. Could you please also describe the issue a little bit more or give some samples? How different is the generated image from the reference? Thanks.

ujjawalcse commented 1 year ago

yaa Sure, Here is the results.zip file which contains the condition.png (reference image) and the output videos in different expressions.

results.zip

eeskimez commented 1 year ago

Are you using our pre-trained model? If so, it is trained on the CREMA-D dataset, which has limited samples/data distribution, and it might not generalize well to images outside of CREMA-D. Therefore, if you want it to generalize well, you might need a more extensive dataset, which is hard to find since we require emotion labels. You can try omitting emotion labels and use the LRW dataset for better generalization.

yzyouzhang commented 1 year ago

I checked your output videos and find that the speech are much longer than those in our training data (~2s). The quality of the first two seconds are reasonable. This might suggest the generalization ability needs to be improved as Emre said.

For the dataset, there is a new emotional talking face dataset released after our publication, called MEAD. Feel free to try to train our model on that dataset. I am also interested in the results.

ujjawalcse commented 1 year ago

Thanks @eeskimez and @yzyouzhang for clarification.