Open alexcbb opened 6 months ago
The dynamics of the model are handled by a decoder-only Mask-GiT. Given a tokenized video (from VQ-VAE) and a latent action (from latent action model), it predicts the next frame.
Feature details
The dynamics of the model are handled by a decoder-only Mask-GiT. Given a tokenized video (from VQ-VAE) and a latent action (from latent action model), it predicts the next frame.
What needs to be done