is it possible to learn an action model (or the action's effects) with v-jepa ?

facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.

Other

2.53k stars 242 forks source link

Hi,

If you want to predict the next pixel values in the next frame based on the previous frame, you can modify the masking methodology to mask only the last frame (or something similar). However, as mentioned in the blog, video may progress slowly which makes this type of task too easy.

It’s also important to note that, in most videos, things evolve somewhat slowly over time. If you mask a portion of the video but only for a specific instant in time and the model can see what came immediately before and/or immediately after, it also makes things too easy and the model almost certainly won’t learn anything interesting. As such, the team used an approach where it masked portions of the video in both space and time, which forces the model to learn and develop an understanding of the scene.

facebookresearch / jepa

is it possible to learn an action model (or the action's effects) with v-jepa ? #71