buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
494 stars 19 forks source link

Containing goal information in observation to make experiments meaningful #11

Closed namespacebilibili closed 1 month ago

namespacebilibili commented 1 month ago

Without guidance, the performance greatly drops. Does you observation contain the goal state? The guidance with distance to goal seems to be too strong--the gradients itself may lead to good results.

namespacebilibili commented 1 month ago

I concatenate goals to states, which reaches good performance even without guidance and mpc.

buoyancy99 commented 1 month ago

The observation doesn't contain the goal, and this is intentional.

First, the reason why the performance drops significantly without guidance is that, without guidance, it's not offline rl nor planning, it's behavior cloning of trajectries in the dataset. The dataset only contains random walk, and of course rarely leads to the desired goal.

On the other hand, diffusion planning methods don't concatenate goals because they are designed to be more general than goal conditional setting. You only need to throw any classifier or likelihood model to bias the sampling at sampling time. For example, at sampling time you can add a classifier that steer the diffusion to generate trajectories with two corners to generate trajectories that avoids three blocks across the map, without retraining the model. Goal guidance is only one property. In high dimensional domains like videos, guidance is even more abstract e.g. make sure frame 10 is a cat lying on the ground, while not specifying any pixels. Diffusion forcing is designed to handle such general cases via diffusion planning, not just goal conditional policy.

Finally, I don't know how get goals from the training data to concatenate, because there is no specific goal for training data, it's just random walk. If you choose the last frame as goal to concatenate, then your model will only reach that goal at the last step, not asap via flexible horizon, and may exploit the bad reward design of the env as I mentioned in the README

namespacebilibili commented 1 month ago

I understand your point and agree that guidance is a key advantage of Diffusions. But dont you think the guidance is too strong for maze ? -- even the model is untrained, the gradient of distance will lead the generated states to goal, like close-loop control. And, in my opinion, the most important point for a policy is to complete tasks, not just learning a behavior distribution. The experiments with such guidance is not that convincing.

To obtain goals in training, just use key "infos/goal" in the dataset. And btw, the result is good with goal in obs! In maze medium, it reaches reward of ~140 without guidance and mpc. I think you could do more experiments to validate the efficiency of DF policy, like on mujoco tasks.

buoyancy99 commented 1 month ago

Oh I understand your criticism now - you are saying that because we have MPC, a strong guidance that just get the general direction correct would still give you high reward without a good sequence model, is that correct?

I think the criticism is valid in that it points out the task is too easy with MPC. Therefore, to make it harder, free to try our MPC free setting by adding algorithm.open_loop_horizon=800 (doesn't need re-training, just load ckpt). This will disable MPC and you will see diffusion forcing is still really really good, where a bad sequence model with same guidance will perform poorly. We also experimented with diffuser like setup as well, where you do guidance by replacement, and it also works really well except that it won't try to reach the goal asap. On the other hand, we noticed that a bad sequence model with such guidance also often fails in non-convex landscape e.g. you need to move further from goal first to eventually reach it, even with MPC. I left the default configurations as such to reflect the full picture you can do, but the commands/checkpoints definitely work without MPC / other settings too. Please let me know whether these ablations address your question.

Goal conditioned policy is definitely a valid solution too! We are not trying to be the best goal conditioned RL framework in our paper, so I am pretty sure there are a lot of interesting things you can do to be a good RL policy!

namespacebilibili commented 1 month ago

Yes, you are right! But I dont really understand why open_loop_horizon disables MPC. In my view, use_diffused_action controlls whether we use diffused action or MPC, which is determined by whether we use action as input. And if open_loop_horizon is 800, does it mean we generate the whole traj at once?

buoyancy99 commented 1 month ago

Yes, it means we generate trajectory at once and do control like the original diffusion planning paper - it's still feedback but not MPC. This is mainly to make a distinction to bad sequence model but with the same guidance which you deemed too strong, so it's just a minimal ablation that address the point that "guidance+bad sequence model also being reasonable due to mpc". The perfect picture is of course use_diffused_action, but I found that to be slightly unstable in the transformer version for now (main insight is normalization, but I can talk about more in separate threads), so the example is using what everybody else is doing.

namespacebilibili commented 1 month ago

That addresses my question! When I train a obs+goal version with use_diffused_action, the instability problem also exists. But that is not a big deal. Thanks for you time again! image