Closed huyduong7101 closed 1 month ago
Hope the above informaiton helps.
Thank you for your quick response. It is very helpful. Can I ask you one more question relating to another issue https://github.com/TMElyralab/MuseTalk/issues/158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?
Thank you for your quick response. It is very helpful. Can I ask you one more question relating to another issue #158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?
Currently we only use a face detector and did not perform bbox shift during training.
In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4 because pretrained model has cross_attention_dim=768 and feature dim of Whisper-tiny is 384. Hence, I wonder why don't use Whisper-small (d_model=768) which has the same dimension as pretrained SDv1.4, then we can utilize the strong pretrained model from SDv1.4