TMElyralab / MuseTalk

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
Other
2.72k stars 331 forks source link

About the relationship between Whisper vs pretrained UNet SDv1.4 #159

Closed huyduong7101 closed 1 month ago

huyduong7101 commented 2 months ago

In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4 because pretrained model has cross_attention_dim=768 and feature dim of Whisper-tiny is 384. Hence, I wonder why don't use Whisper-small (d_model=768) which has the same dimension as pretrained SDv1.4, then we can utilize the strong pretrained model from SDv1.4

czk32611 commented 2 months ago
  1. The reason why we used whisper-tiny is to have a smaller time delay during real-time inference.
  2. We did not use pretrained SDv1.4 because SDv1.4 is an image-to-noise model, not an image-to-image model. However, someone had tried to use pretrained SDv1.4 as initialization and actually it converged faster.
  3. The dimention of audio feature is not important, as one can always use projection networks to have a different shapes.

Hope the above informaiton helps.

huyduong7101 commented 2 months ago

Thank you for your quick response. It is very helpful. Can I ask you one more question relating to another issue https://github.com/TMElyralab/MuseTalk/issues/158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?

czk32611 commented 2 months ago

Thank you for your quick response. It is very helpful. Can I ask you one more question relating to another issue #158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?

Currently we only use a face detector and did not perform bbox shift during training.