facebookresearch / jepa

PyTorch code and models for V-JEPA self-supervised learning from video.
Other
2.53k stars 242 forks source link

is it possible to learn an action model (or the action's effects) with v-jepa ? #71

Open aymeric75 opened 1 week ago

aymeric75 commented 1 week ago

Hello,

I would like to know if it is possible to add the knowledge of the actions performed by an agent into the architecture.

From my understanding the unmasked part of the image and the coordinates of the masked parts are given as input to the predictor (which predicts the masked parts). So, as I understand, the prediction predicts static elements (parts of the same image) and not next states.

Would it be possible, instead, to make Jepa to predict next images, given a present image and an action ? Or, can the actual implementation be used to produce representations that would fit in this downstream task (i.e. obtaining the "effects" of an action onto an image) ?

Thanks a lot

icekang commented 2 days ago

Hi,

If you want to predict the next pixel values in the next frame based on the previous frame, you can modify the masking methodology to mask only the last frame (or something similar). However, as mentioned in the blog, video may progress slowly which makes this type of task too easy.

It’s also important to note that, in most videos, things evolve somewhat slowly over time. If you mask a portion of the video but only for a specific instant in time and the model can see what came immediately before and/or immediately after, it also makes things too easy and the model almost certainly won’t learn anything interesting. As such, the team used an approach where it masked portions of the video in both space and time, which forces the model to learn and develop an understanding of the scene.