buoyancy99 / diffusion-forcing

code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"
Other
494 stars 19 forks source link

Experiments on Maze2D Planning #9

Closed Perkins729 closed 1 month ago

Perkins729 commented 1 month ago

Hello, thank you for open sourcing your code. I am trying to replicate the training and validation process on the Maze2D Planning task, without making any changes to the code and settings (with guidance_scale set to 0.5). The results of the inference look like this, and it seems that there are issues with collisions with walls/reaching the goal but not stopping/being unable to reach the goal. Where might the problem lie?

image
buoyancy99 commented 1 month ago

Hello, the fact that you are seeing this is that

  1. Stopping is actually never a behavior contained in the dataset, so it will always go back and force around the goal unless I do Diffuser's conditioning by replacement that let it happen to end at the goal
  2. we are directly executing the diffused action instead of coding a PD controller like previous methods to offer a more general framework, and this happen to be quite hard on this environment. If we use prior work's simpler setting, diffusion forcing can have really nice and perpendicular trajectories too
  3. There are some problems with the initial version of the code, have you pulled the latest code and try what I did in #2? ]
buoyancy99 commented 1 month ago

Also, I will release the transformer version code very soon, which is faster and better in many ways! This problem will also be mitigated

Perkins729 commented 1 month ago

Hello, the fact that you are seeing this is that

  1. Stopping is actually never a behavior contained in the dataset, so it will always go back and force around the goal unless I do Diffuser's conditioning by replacement that let it happen to end at the goal
  2. we are directly executing the diffused action instead of coding a PD controller like previous methods to offer a more general framework, and this happen to be quite hard on this environment. If we use prior work's simpler setting, diffusion forcing can have really nice and perpendicular trajectories too
  3. There are some problems with the initial version of the code, have you pulled the latest code and try what I did in guidance_scale in df_planning.yaml #2? ]

I got it, and I also have another question : when I increase the guidance scale, it becomes easier to reach the target position, but the trajectory becomes less smooth. What are your insights on this? Thanks for your time.

buoyancy99 commented 1 month ago

Hi, I just released the transformer version on the main branch. Please take a look at the updated README. This version has non of these isssues.

About the insight - why does diffusion models can do classifier guidance? It's because diffusion model is trained to optimize p(x) by sampling with grad(ln(p(x))). What if we sample with grad(ln(p(x))) + grad(ln(c(x))) instead? This will be sampling by maximizing p(x) * c(x), saying my sample x should both satisfy data distribution aka likelihood of dynamics and classifier aka reward. However, this is an oversimplification because p and c are not independent, plus the two objectives compete with each other. That is, if you over emphasize c, you lose some realisticness of p, vice versa

buoyancy99 commented 1 month ago

BTW this is a visualization of the transformer implementation

Screenshot 2024-07-30 at 6 38 36 PM

Perkins729 commented 1 month ago

BTW this is a visualization of the transformer implementation

Screenshot 2024-07-30 at 6 38 36 PM

I have actually trained with your latest released transformer-based code, and the results are indeed very good, thank you for open sourcing it. However, I have two questions: 1. You mentioned in the new README: 'This version of maze planning uses a different version of diffusion forcing from the original paper - while doing the follow-up to diffusion forcing, we realized that training with independent noise actually constructed a smooth interpolation between causal and non-causal models too, since we can just mask out future by complete noise (fully causal) or some noise (interpolation). The best thing is, you can still account for causal uncertainty via pyramid sampling in this setting, by masking out tokens at different noise levels, and you can still have a flexible horizon because you can tell the model that padded entries are pure noise, a unique ability of diffusion forcing.' I think the original version of the code(paper version?) also trained with independent noise, right? Where has the new version been updated? 2. On the Maze2D-medium dataset, the learning space of the RNN-based code is a six-dimensional vector of [observation, action], but the learning space of the current transformer-based code is the first two dimensions of the observation, because the command specifies dataset.action_mean=[] dataset.action_std=[] dataset.observation_mean=[3.5092521,3.4765592] dataset.observation_std=[1.3371079,1.52102]. Why not learn the action and observation jointly?

namespacebilibili commented 1 month ago

For the second problem, I believe using MPC does not need diffused actions. But if you want to use actions, that's ok and result is promising.

buoyancy99 commented 1 month ago

BTW this is a visualization of the transformer implementation Screenshot 2024-07-30 at 6 38 36 PM

I have actually trained with your latest released transformer-based code, and the results are indeed very good, thank you for open sourcing it. However, I have two questions: 1. You mentioned in the new README: 'This version of maze planning uses a different version of diffusion forcing from the original paper - while doing the follow-up to diffusion forcing, we realized that training with independent noise actually constructed a smooth interpolation between causal and non-causal models too, since we can just mask out future by complete noise (fully causal) or some noise (interpolation). The best thing is, you can still account for causal uncertainty via pyramid sampling in this setting, by masking out tokens at different noise levels, and you can still have a flexible horizon because you can tell the model that padded entries are pure noise, a unique ability of diffusion forcing.' I think the original version of the code(paper version?) also trained with independent noise, right? Where has the new version been updated? 2. On the Maze2D-medium dataset, the learning space of the RNN-based code is a six-dimensional vector of [observation, action], but the learning space of the current transformer-based code is the first two dimensions of the observation, because the command specifies dataset.action_mean=[] dataset.action_std=[] dataset.observation_mean=[3.5092521,3.4765592] dataset.observation_std=[1.3371079,1.52102]. Why not learn the action and observation jointly?

Somehow I missed this follow up question. There are some interesting insights we gained recently - when diffusing a sequence x_{1:T}, if you jointly diffuses its derivative and the derivative is normalized, then it will become very unstable especially for transformers. Why is this? Diffussion model's math defines a likelihood according to a gaussian model - which reduces to l2 loss. However, a gaussian likelihood model is particularly bad! Notice that the values of derivatives are very small, if we normalize them, we are putting a heavy emphasis on the derivative. However, since we already have the x sequence that's heavily focused on due to normalization, they together leads to a high (over) constrained optimization landscape! This is even more obvious when you have second order derivatives. In our latest experiments for Diffusion Forcing 2, our training / generation is very stable on environments whose obs/action doesn't have a clear derivative relationships, and is very unstable on those that do (you can to manually tune normalization heavity) Somehow we didn't observe this as much on diffusion forcing v1, likely due to RNN

Perkins729 commented 1 month ago

BTW this is a visualization of the transformer implementation Screenshot 2024-07-30 at 6 38 36 PM

I have actually trained with your latest released transformer-based code, and the results are indeed very good, thank you for open sourcing it. However, I have two questions: 1. You mentioned in the new README: 'This version of maze planning uses a different version of diffusion forcing from the original paper - while doing the follow-up to diffusion forcing, we realized that training with independent noise actually constructed a smooth interpolation between causal and non-causal models too, since we can just mask out future by complete noise (fully causal) or some noise (interpolation). The best thing is, you can still account for causal uncertainty via pyramid sampling in this setting, by masking out tokens at different noise levels, and you can still have a flexible horizon because you can tell the model that padded entries are pure noise, a unique ability of diffusion forcing.' I think the original version of the code(paper version?) also trained with independent noise, right? Where has the new version been updated? 2. On the Maze2D-medium dataset, the learning space of the RNN-based code is a six-dimensional vector of [observation, action], but the learning space of the current transformer-based code is the first two dimensions of the observation, because the command specifies dataset.action_mean=[] dataset.action_std=[] dataset.observation_mean=[3.5092521,3.4765592] dataset.observation_std=[1.3371079,1.52102]. Why not learn the action and observation jointly?

Somehow I missed this follow up question. There are some interesting insights we gained recently - when diffusing a sequence x_{1:T}, if you jointly diffuses it's derivative and the derivative is normalized, then it will become very unstable especially for transformers. Why is this? Diffussion model's math defines a likelihood according to a gaussian model - which reduces to l2 loss. However, a gaussian likelihood model is particularly bad! Notice that the values of derivatives are very small, if we normalize them, we are putting a heavy emphasis on the derivative. However, since we already have the x sequence that's heavily focused on due to normalization, they together leads to a high (over) constrained optimization landscape! This is even more obvious when you have second order derivatives. In our latest experiments for Diffusion Forcing 2, our training / generation is very stable on environments whose obs/action doesn't have a clear derivative relationships, and is very unstable on those that do (you can to manually tune normalization heavity) Somehow we didn't observe this as much on diffusion forcing v1, likely due to RNN

Amazing insights !