Closed zwfcrazy closed 4 years ago
You can get rid of the recognition and adversarial part of the model. Then it can work regardless of language and input lengths. Although a crucial part is removed, I think at least reasonable results can be obtained in this way with acceptable performance. It will be better if the pretrained weights of our model can be loaded then finetuned on your dataset. However, you may need to modify the code (delete several parts, modify input length) for it to work well.
@zwfcrazy have you tried this https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose seems to work regardless of language
@Hangz-nju-cuhk this paper https://arxiv.org/pdf/2004.12992.pdf cites this work and is able to handle head pose and speaker awareness
@ak9250 Thanks for your reference. I am familiar with both these papers and even have seen their videos before they are on arxiv. They are both great works. I would definitely recommend researchers to try the state-of-the-art models, as mine seems a little out-of-date for now.
@ak9250 @Hangz-nju-cuhk sorry for the late reply. Thank you both! I will close this issue for now.
I want to build a dataset of Chinese characters to train this model. I applied speech recognition on some Chinese news videos (by CCTV). The recognition part was fine, but I found that Chinese characters are too short in terms of pronounce time because each of them has only one syllable. The average number of video frames it takes to show the lip movement of a single Chinese character is only 5 (fps=25), and It can be even as low as 2 frames. This is much less than the required 29 frames. Obviously, interpolation won't work well in this case. So I would like to know if you guys have considered Chinese? Will this model work? Is there any workaround?