Thanks for such a wonderful work! I'm curious about the video model. What are the datasets used for the training of the video model? In my opinion, in the dmc setting, the video model will use expert trajectories of all dmc tasks, like walker-walk, and cheetah-run. Is it right? If so, how can the model generate videos of different tasks with the same embodiment (like walker-stand and walker-walk)?
The video model samples consecutive frames so sampled trajectories for the same embodiment will represent different tasks. One can condition on a task ID or text to sample a video of a particular task.
Thanks for such a wonderful work! I'm curious about the video model. What are the datasets used for the training of the video model? In my opinion, in the dmc setting, the video model will use expert trajectories of all dmc tasks, like walker-walk, and cheetah-run. Is it right? If so, how can the model generate videos of different tasks with the same embodiment (like walker-stand and walker-walk)?
Thanks again!