felix-thu / DiffCPS

Diffusion Model Based Constrained Policy Search for Offline Reinforcement Learning
Apache License 2.0
6 stars 0 forks source link

my recurrence not convergent in AntMaze environment? #1

Open jensen-Shen-Li opened 10 months ago

jensen-Shen-Li commented 10 months ago

I git clone and run "python run --env_name antmaze-umaze-v0 --device 0 --lr_decay" without any modification, but the results were good at first, after 550000 epochs, the results became 0 until end. There are some parameters need to modify? Or the result can be recoded with best score?

felix-thu commented 10 months ago

This might be due to the issue that random seeds exhibit different behaviors on various devices. Given that diffusion models fundamentally rely on random sampling, the impact of random seeds on them is relatively significant. Additionally, in the context of sparse rewards in AntMaze, if the model is overfitted, it results in an output score of 0. The phenomenon you described also occurs in the training process of Diffusion QL [ICLR 2023] in antmaze-large. You can freely adjust the target_kl and lambda_min to avoid this phenomenon in the umaze task. In my code, the best results will be recorded.

On Fri, Dec 15, 2023 at 8:41 PM Longxiang he @.***> wrote:

This might be due to the issue that random seeds exhibit different behaviors on various devices. Given that diffusion models fundamentally rely on random sampling, the impact of random seeds on them is relatively significant. Additionally, in the context of sparse rewards in AntMaze, if the model is overfitted, it results in an output score of 0. The phenomenon you described also occurs in the training process of Diffusion QL [ICLR 2023] in antmaze-large. You can freely adjust the target_kl and lambda_min to avoid this phenomenon in the umaze task. In my code, the best results will be recorded.

— Reply to this email directly, view it on GitHub https://github.com/felix-thu/DiffCPS/issues/1#issuecomment-1857823813, or unsubscribe https://github.com/notifications/unsubscribe-auth/BDD6PJ7XF66E5PYRNHOYA6TYJRARDAVCNFSM6AAAAABAWEREO2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJXHAZDGOBRGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jensen-Shen-Li commented 10 months ago

Oh, thank you and thanks for your reply. I also made a comparison with Diffusion QL[ICLR2023], and like you said, the overfitting phenomenon happened again. In both DiffCPS and Diffusion QL, I utilized the best parameters recoded in the codes, but because seeds was not given, I can't recurrence the best score reported in paper. I would like to ask, what are the skills of seeds selection, or how to adjust the seeds to achieve the best results in paper.