buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
553 stars 24 forks source link

Questions about robot imitation learning #15

Closed AaronChuh closed 1 month ago

AaronChuh commented 1 month ago

Hi, thanks for your excellent job! I have some questions about the section 4.4 in your paper.

  1. How do you represent a latent for robot imitation learning? Is it a tensor of shape (c, w, h) just like in video prediction task?
  2. How do you predict the trajectory for the robot end-effector? Do you predict the executable trajectory directly like Diffusion Policy or predict a keypose for every latent state?
  3. How do encode the observation? And you use 2 cameras for manipulation task as most people do, so how do you fuse 2 observations into a latent?

Look forward to your kind reply!

buoyancy99 commented 1 month ago
  1. See this line of code. I pack actions which include 4 dim of rotation quaternion + 3 dim of XYZ translation + 1 dim of gripper state into 2d and concat them as channels to an image
  2. In each token I put 10 actions for faster speed, so it's like diffusion policy, although you can make it do many times of 10 with full sequence sampling.
  3. Just concat them channel wise, see code for more details