Questions about robot imitation learning

buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"

Other

553 stars 24 forks source link

Hi, thanks for your excellent job! I have some questions about the section 4.4 in your paper.

How do you represent a latent for robot imitation learning? Is it a tensor of shape (c, w, h) just like in video prediction task?
How do you predict the trajectory for the robot end-effector? Do you predict the executable trajectory directly like Diffusion Policy or predict a keypose for every latent state?
How do encode the observation? And you use 2 cameras for manipulation task as most people do, so how do you fuse 2 observations into a latent?

Look forward to your kind reply!

buoyancy99 / diffusion-forcing