questions about video frames

Large-Trajectory-Model / ATM

Official codebase for "Any-point Trajectory Modeling for Policy Learning"

https://xingyu-lin.github.io/atm/

MIT License

180 stars 19 forks source link

questions about video frames #20

Open cxchhh opened 1 month ago

cxchhh commented 1 month ago

Hi, I have a few questions about the image tokens

In the paper, the image tokens were extracted from video. But in the conf of this code, it seems that the model actually only uses 1 frame to predict tracking.

So I wonder if 1 frame is enough for this task, or maybe there's other alternatives even with better performance?

AlvinWen428 commented 1 month ago

Hi, thank you very much for the question. It's true that we found 1 frame is enough for the Track Transformer in these tasks. We also tried to include historical frames but sometimes the causal confusion problem happened.

cxchhh commented 1 month ago

Thanks for the answers. Also, I wonder why you choose to predict just 1 future timestep of action instead of multiple actions (chunking)?

AlvinWen428 commented 1 month ago

Haha, this is the standard implementation of behavioral cloning before the Diffusion Policy. You can have a try to predict the future action sequence, which might improve the performance.

cxchhh commented 3 weeks ago

Thanks. Btw, how did you prevent the atm from confusing similar task instructions in the libero_goal, since the initial scene of that suite is identical?

AlvinWen428 commented 2 weeks ago

Our Track Transformer has language embedding as input so that it can figure out what task it is.