Open cxchhh opened 1 month ago
Hi, thank you very much for the question. It's true that we found 1 frame is enough for the Track Transformer in these tasks. We also tried to include historical frames but sometimes the causal confusion problem happened.
Thanks for the answers. Also, I wonder why you choose to predict just 1 future timestep of action instead of multiple actions (chunking)?
Haha, this is the standard implementation of behavioral cloning before the Diffusion Policy. You can have a try to predict the future action sequence, which might improve the performance.
Thanks. Btw, how did you prevent the atm from confusing similar task instructions in the libero_goal, since the initial scene of that suite is identical?
Our Track Transformer has language embedding as input so that it can figure out what task it is.
Hi, I have a few questions about the image tokens
In the paper, the image tokens were extracted from video. But in the conf of this code, it seems that the model actually only uses 1 frame to predict tracking.
So I wonder if 1 frame is enough for this task, or maybe there's other alternatives even with better performance?