Zhendong-Wang / Diffusion-Policies-for-Offline-RL

Apache License 2.0
219 stars 33 forks source link

Questions about overfitting & Q-value loss weight #15

Open return-sleep opened 6 months ago

return-sleep commented 6 months ago

When I trained the model, in the middle and late stages of training, its performance gradually declined as the number of training steps increased. Is this due to overfitting the behavioral policy or is it due to exploration error? May I ask if you have encountered such a situation? Should I increase the Q-value weights or decrease them?

Zhendong-Wang commented 6 months ago

During my training, usually it is stable especially for mojoco tasks. Which environment are you testing? If it happens, maybe you can moniter the q-value function loss. Usually decreasing the Q-value weights stabalize the training, since the algorithm will be tuned to be more similar to a behavior cloning algorithm.

return-sleep commented 6 months ago

During my training, usually it is stable especially for mojoco tasks. Which environment are you testing? If it happens, maybe you can moniter the q-value function loss. Usually decreasing the Q-value weights stabalize the training, since the algorithm will be tuned to be more similar to a behavior cloning algorithm.

Thank you for your reply, I tested on vision-based mujoco tasks and perform Diffusion_QL in latent state space. I noticed your design regarding the hyperparameter q_value (from 0.01 to 3.5 across different tasks), may I ask if is it generally true that the more complex the task is, the more we should tend to lower values, which means that we focus on the behavioral cloning term?

Zhendong-Wang commented 5 months ago

From my experience, it depends. If the offline dataset is already very good, then a small $eta$ could work well. If the offline dataset is just medium well, then you need more policy optimization to reach good performance. And some complicated tasks such as AntMaze also requires strong Q-learning to find the final destination.