facebookresearch / VisualVoice

Audio-Visual Speech Separation with Cross-Modal Consistency
Other
218 stars 35 forks source link

Variants of Pretrained Model #17

Closed IFICL closed 1 year ago

IFICL commented 2 years ago

Nice work and impressive results. From the ablation study of your paper, I saw some variants of your model that only take static face (identity features) or lip motion features as visual signals.

Personally, I am interested in those ablation models. I wonder whether I can ask for pre-trained model weights for the static face version. I assume the current weights you provide are the full model, if I change the configure to identity feature as input, the current weight of U-net is not supported for this feature input. While I try to zero out the lip motion feature and perform a demo video, i.e. visual_feature = torch.cat((identity_feature, lipreading_feature * 0), dim=1), it could not work.