EmptyJackson / policy-guided-diffusion

Official implementation of the RLC 2024 paper "Policy-Guided Diffusion"
MIT License
117 stars 7 forks source link

The value loss exploded when using pure synthetic data to train iql. #5

Closed StepNeverStop closed 3 months ago

StepNeverStop commented 5 months ago

hi,

One phenomenon that I am very confused about is that when I train IQL by using this repo, the value function loss will gradually explode when I train the policy with pure synthetic data, but not with real datasets.

Like hopper experimental drawing:

image

Below is how it works with original real datasets on hopper-meidum-v2, it seems normal:

image

I keep all the default configurations when training agents except for changing the experience mix ratio to 100% : 0% or 0% : 100%.

I totally have no idea why the loss of V and Q functions exploded. I also noticed that when training the diffusion model, the original paper wrote a trajectory length of 16, and I used the default parameter 32 in this repo.

If you have any insights or guidance, thank you for your reply!!!

StepNeverStop commented 5 months ago

By the way, I observed this phenomenon across all datasets.

EmptyJackson commented 3 months ago

Hi, we've just released training logs that should help with this: https://api.wandb.ai/links/flair/jonpqc2o

Generally, this divergence occurs due to an undertrained diffusion model. You can see this in one of the above runs (hopper-random) where the unguided and policy-guided optimization is unstable.

I'd recommend having another go with the hyperparameters given in the above logs - which match those given in the paper - as they should be stable. I'm happy to chat further if they don't work or anything else comes up!