Questions About the Output Shape of Features Extracted From a Pre-trained Model

ControlNet / MARLIN

[CVPR] MARLIN: Masked Autoencoder for facial video Representation LearnINg

https://openaccess.thecvf.com/content/CVPR2023/html/Cai_MARLIN_Masked_Autoencoder_for_Facial_Video_Representation_LearnINg_CVPR_2023_paper

Other

209 stars 20 forks source link

Questions About the Output Shape of Features Extracted From a Pre-trained Model #20

Closed IQIUM closed 3 months ago

IQIUM commented 7 months ago

Hello author, @ControlNet . I want to use Marlin to extract facial features from videos and use them for my downstream tasks. However, I have some doubts about the shape of the facial features extracted by Marlin.

For example, for a video with 214 frames, the facial feature extracted by Marlin has a shape of torch.Size([6, 1024]). I know the 1024 dimensions are from using marlin_vit_large_ytf, but I'm not sure where the 6 comes from. Is it because of random sampling?

ControlNet commented 7 months ago

No. The 6 is the temporal axis which is generated by sliding window. The default stride is 16 and temporal sample rate is 2. Therefore, 214 // (16 * 2) = 6.

def extract_video(self, video_path: str, crop_face: bool = False, sample_rate: int = 2,
    stride: int = 16,
    reduction: str = "none",
    keep_seq: bool = False,
    detector_device: Optional[str] = None
) -> Tensor: