Doubiiu / CodeTalker

[CVPR 2023] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
MIT License
515 stars 57 forks source link

Fail to reproduce on MEAD dataset #46

Open HarryXD2018 opened 1 year ago

HarryXD2018 commented 1 year ago

Hi, what a nice work!

I am currently attempting to reproduce this work on the MEAD dataset. Stage 1 of the process has gone smoothly, however, I am encountering an issue in Stage 2. After 20 epochs of training, I am not observing any movement in the output, and it remains static.

Do you have any idea?

Many thanks!

Doubiiu commented 1 year ago

Hi, did you use some 3D face reconstruction methods to convert MEAD to 3D data? And you mean you have visually checked the results of stage1 (reconstruction) and it was well-done. Based on my early attempts on VOCASET and BIWI, It is harder to train stage2 as it depends on the results of stage1 and hypter-parameters/network architecture (e.g. transformer layers, no. of heads, etc.) of stage2 model, you may need to modify them if possible (Not easy to make VQ stuff work as expectation). Hope you can make it work as soon as possible~

HarryXD2018 commented 1 year ago

Thanks for the quick reply! :smile:

Yes, I reconstructed the MEAD dataset with EMOCA, and I use the FLAME parameters to represent face shapes instead of vertices to reduce the data volume. Concretely, I visualized the stage 1 results, and slight jitters were observed, is that ok? I early stopped the training because it really took time.

If okay, I will mainly focus on the architecture engineering to make it work.

Doubiiu commented 1 year ago

I got it. I am not sure about the performance by mapping the audio to FLAME parameters using codetalker (or VQ-based method). Since the stage1 is the expected upper bound of the stage2 results, I think you'd better make sure it works normally (without artifacts and smoothly).