Hi, thanks for your excellent job! I have some questions about the section 4.4 in your paper.
How do you represent a latent for robot imitation learning? Is it a tensor of shape (c, w, h) just like in video prediction task?
How do you predict the trajectory for the robot end-effector? Do you predict the executable trajectory directly like Diffusion Policy or predict a keypose for every latent state?
How do encode the observation? And you use 2 cameras for manipulation task as most people do, so how do you fuse 2 observations into a latent?
See this line of code. I pack actions which include 4 dim of rotation quaternion + 3 dim of XYZ translation + 1 dim of gripper state into 2d and concat them as channels to an image
In each token I put 10 actions for faster speed, so it's like diffusion policy, although you can make it do many times of 10 with full sequence sampling.
Just concat them channel wise, see code for more details
Hi, thanks for your excellent job! I have some questions about the section 4.4 in your paper.
Look forward to your kind reply!