ControlNet / MARLIN

[CVPR] MARLIN: Masked Autoencoder for facial video Representation LearnINg
https://openaccess.thecvf.com/content/CVPR2023/html/Cai_MARLIN_Masked_Autoencoder_for_Facial_Video_Representation_LearnINg_CVPR_2023_paper
Other
209 stars 20 forks source link

How to use Marlin for downstream tasks? #19

Closed rainbowoldhorse closed 7 months ago

rainbowoldhorse commented 7 months ago

Hello author, @ControlNet I want use your Marlin model on Wav2lip. After reviewing the decoder code you provided earlier, I have a few questions that I would like to ask you,

1、I understand that your x is the result of the Marlin encoder. Taking the first merging tensor as an example, it is (B, 8, 768,14,14), IMG consists of 16 images of the lower half of the face, stacked in the channel,so img shape is (B, 3 * 2, 8, 14, 14). The reason why I understand this is because 768 needs to be added to 6 at the channel, and I cannot think of a better explanation.Is my understanding correct? If not, what should be the original meaning and shape of "x" and "img"?

2、In your decoder code, I cannot infer where the audio features are fused with the Marlin features. How can I integrate the audio?

Sincerely seeking your advice on the above two questions, I hope you can answer my confusion,thanks.

ControlNet commented 7 months ago
  1. Shapes

The syncnet is modified and train on the 16 frames with resolution (48x96). Assume the arch is ViT-B. From the frame encoder, the face embedding is (B, 1568, 768). Then it is reshaped to (B*16, 384, 14, 14) following the original wav2lip. Same, the audio embedding is (B*16, 384, 1, 1) then expand to (B*16, 384, 14, 14).

  1. In your decoder code, I cannot infer where the audio features are fused with the Marlin features. How can I integrate the audio?

With the vanilla wav2lip audio encoder, the audio is mapped to the audio feature. Then the audio feature and frame features are concatenated as the input of decoder with the shape (B*16, 384+384, 14, 14)

rainbowoldhorse commented 7 months ago

Thank you very much for your answer! I have three more questions to confirm with you.

  1. Is Marlin's input used as a reference frame for 16 full face images or 16 masked half of the images to be restored (B, 3, 16, 224, 224)?

  2. Is "img"(In your decoder for wav2lip) 16 half face images and 16 full face images stacked at the channel (B*16, 3+3, 224, 224)?

  3. When using the LRS2 dataset, do you directly resize the images to (224, 224)?

ControlNet commented 7 months ago

For 1, and 2, what we mainly did for this is just replacing the architecture of encoder and decoder to fit ViT archiecture. Also the temporal size are modified to 16. Please check the wav2lip implementation (https://github.com/Rudrabha/Wav2Lip/blob/master/models/wav2lip.py) for your questions. We didn't change others in wav2lip, therefore we called this method "wav2lip + MARLIN" to show it is just a slight modification rather than a new novelty method.

For 3, yes. BTW after the output is generated from the decoder, we resized it back to fit the input shape of syncnet.

rainbowoldhorse commented 7 months ago

Sincerely thank you for your reply.

ggaabe commented 7 months ago

this is awesome stuff. @rainbowoldhorse would love to see what you're building with this, I'm about to try to do what you're doing here. what hardware are you using for training? I'm probably going to try running training on an M2 mac but not sure yet if it'll be enough

rainbowoldhorse commented 7 months ago

this is awesome stuff. @rainbowoldhorse would love to see what you're building with this, I'm about to try to do what you're doing here. what hardware are you using for training? I'm probably going to try running training on an M2 mac but not sure yet if it'll be enough

I'm sorry, I'm not sure if Mac can run machine learning models. I haven't used Mac before, but I can run it using a 1060 (6G) graphics card in Windows/Linux,