Inference Speed Question

mbreuss commented 3 months ago

Hi @StarCycle

thanks for your contributions toward making GR-1 fully open source! I was curious about the inference speed of GR-1 compared to our MDT policy (if you tried it too) Can you share some of your experiences?:)

Thanks!

StarCycle commented 3 months ago

Hi @mbreuss,

Sorry I haven't run MDT or measured the inference speed of GR-1 yet...but I did think about it:

Inference of GR-1 can be much quicker if we choose not to predict future images and only predict future actions. It's possible because both image query tokens and action query tokens are masked in the attention mechanism. Not sending the image query tokens into GR-1 will not infleunce decoding of actions. I can help you to measure the time if you need.
However, as we discussed before, my GR-chunk (if we dont use the original GR-1 setting) predicts multiple future actions and only executes the first one, while MDT will execute all actions it predicts. Thus, I think MDT will be less influenced by the inference speed problem.
The drawback of executing all predicted actions is the policy cannot react the emergency cases. Imaging the policy is executing the predicted action sequence and then a people suddenly comes and disturbs the environment...

By the way, I have some questions about accelerating diffusion models:

There are some works accelerating diffusion policies, including ManiCM and Consistency Policy (based on consistency model), adaflow (based on rectified flow). I haven't try them yet. Emmm which approach is better according to your experience?
I am developing a robotics video generation model based on this genie implementation. Genie is a MaskGIT model that seems to be quicker than diffusion models. But I am really not familiar with recent researches that accelerates video generation DiT, would you like to recommend some to me?

About using CLA loss in my network:

I just release a policy that is based on Microsoft's Florence2 VLM: mimictest. I hope by using a pretrained VLM, I dont need to do language-vision alignment by myself...One version is similar to MDT:

But there are some confusing phenomenons:

I analyse the failure cases of Chi Cheng's Diffusion UNet and Diffusion Transformer. His policy seldomly "hersitate" or "pause" before picking and placing an object, even if it enter a state that not seen in the training dataset. I guess that's because diffusion policy can generate different actions under the same state so it will "explore" the environment until it sees a familiar state.
But when I use Florence2+diffusion transformer, it tends to "hersitate" and "pause" again...Why?
You can see the failure cases and analysis here.

StarCycle

mbreuss commented 3 months ago

Thanks for the detailed answer!

the GR-1 part is not that important I was just curious :D
I believe that a middle way between replanning every single step and full trajectory rollout would be ideal. Maybe with some kind of gaiting that enables the model to actively replan if certain conditions are met to enable fast reactions in cases of emergency would be my best bet for the future

Accelerating Diffusion Models

I think overall these ideas are good and necessary to explore but for developing novel ideas with Diffusion as an action representation, I just stick to my EDM Diff + 5 Denoising Steps. This variant is usually pretty strong already and enough for most applications. So I never explored them myself. The flow community is accelerating fast so I think we can see very strong and easy to train models very soon that surpass rectified ones.

Regarding your questions for MDT and VLM Diffusion

I think your overall idea and thinking is good. You don't need CLA for a pretrained VLM backbone given its deep understanding of both modalities from pretraining.

Choice of Diffusion Head

Why are you using the default Diff Policy Transformer and not the MDT Decoder link? I can highly recommend you to use the FiLM Conditioned Decoder from MDT to conditioned on the noise level effectively and separate the noise token from the other state and goal tokens! Otherwise, especially in settings withy many obs and goal tokens the model is not great. I compared the MDT architecture with the default Diff-T on CALVIN and it only achieves like 1.5 avrg rollout length.
Which part of the VLM are you training for Florence? Can you encode multiple camera views with it already?
Did you try out different ways to use the tokens for the action head? Did you test to use all output tokens as well?

Weird Rollout Behavior

Thats indeed strange but since its a simple simulation and both variants do it, I would not emphasize on it too much. It could be some weird control thing from the Sim or maybe a few demos with this behavior. Since the demos are human collected pauses could be part of it too
"it will "explore" the environment until it sees a familiar state." I am not sure about this statement BC even with Diffusion is just trying to to predict some plausible action given the current state but there is no ability in it to detect when being ouf of know state distributions.

Here is some low level example for that: The model is only trained on the demos in a small area of [-6, 6]. The DP generalizes the overall trend. I averaged 100 action predictions and the low variance in x-areas below -6 and above 6 shows that the model is pretty certain on what it is doing although it has never seen these states.

Video Diffusion

I have seen a few papers that try to accelerate VideoDiff but nothing that enables fast and easy video gen so far. In Theory you could distill them using Consistency Models but it takes additional distillation training which is not cheap. But more work in this direction is being published like that one: https://oahzxl.github.io/PAB/ so I think you can see rapid advances soon, given the OpenSORA initiative and similar projects https://github.com/hpcaitech/Open-Sora

GENIE

That looks super interesting! Keep me updated on this, I believe that GENIE offers many interesting applications for robotics!

If you want you can drop me an email to moritz.reuss@kit.edu when you have more questions regarding Diffusion Policies and similar ideas! I would be interested to discuss more.

Best, Moritz

StarCycle commented 3 months ago

Great thanks! I sent an email to you!

EDiRobotics / GR1-Training

Inference Speed Question #3