few questions on feature detection

Hi, I just need some clarification if I have things correct, this is based on video eval. The dataloader divides a single video clip from a single class into 16 frames and resizes to 224? [1, 3, 16, 224, 224]. The two lists wrapping this tensor are clips and views. Views being the number of splices for a videos in a class, and the number of clips being different classes? How are these two different if I am mistaken.

The forward returns 1568 which is 14x14 x [clip length/2], which implies the delta/feature_changes from frame_a to frame_b? So it really wouldn't matter if we make positional embeddings to adjust the frame length from 2 to 200? There is no cross attention implementation yet for video inference so I am assuming it doesn't matter on the number of views per clip or number of classes passed through at any given time?

Thank you kindly.

facebookresearch / jepa

few questions on feature detection #30