Closed austinmw closed 2 years ago
Hi @austinmw Apologies for the delay in responding. The model performs future prediction in the feature space. The features are then projected into a distribution over actions using a linear layer. In principle, that linear projection can be replaced with a decoder network to generate pixel values, however we did not experiment with that in this paper.
If I have a set of videos with no labeled actions, can I use this to just predict the next frame in a video?