Closed IQIUM closed 3 months ago
No. The 6
is the temporal axis which is generated by sliding window. The default stride is 16 and temporal sample rate is 2. Therefore, 214 // (16 * 2) = 6.
def extract_video(self, video_path: str, crop_face: bool = False, sample_rate: int = 2,
stride: int = 16,
reduction: str = "none",
keep_seq: bool = False,
detector_device: Optional[str] = None
) -> Tensor:
Hello author, @ControlNet . I want to use Marlin to extract facial features from videos and use them for my downstream tasks. However, I have some doubts about the shape of the facial features extracted by Marlin.
For example, for a video with 214 frames, the facial feature extracted by Marlin has a shape of torch.Size([6, 1024]). I know the 1024 dimensions are from using marlin_vit_large_ytf, but I'm not sure where the 6 comes from. Is it because of random sampling?