Div99 / IQ-Learn

(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation
https://div99.github.io/IQ-Learn/
Other
204 stars 31 forks source link

Issue on reproduce MuJoCo results #2

Closed Ending2015a closed 2 years ago

Ending2015a commented 2 years ago

Hello! Could you provide the hyperparameters and the number of training steps of each MuJoCo env to reproduce the Table5 results?(Appendix D.2 in the original paper) I've tried the iq_learn/scripts/run_mujoco.sh script to train on Ant-v2 for \~300k steps with default hyperparameters and 10 expert trajectories. But only got the eval returns around 3000~4000. The eval/episode_reward shows 3301.59521 and the best_returns is 4275.31665. thank you!

Div99 commented 2 years ago

Hi, thanks for the interest in this work. We reliably reach ~4700 as reported in the paper on Ant-v2 with 10 experts over three different seeds. See the chart below:

Interactive W&B Link: Wandb Chart

Ant-v2 with 10 experts

Div99 commented 2 years ago

These are the parameters used: agent=sac agent.actor_lr=3e-05 agent.critic_lr=0.0003 agent.init_temperature=0.001 agent.learnable_temperature=False env=ant env.learn_steps=1e6 env.demos=10 method.loss=value method.regularize=True num_actor_updates=1 num_seed_steps=0 train.batch=256 train.soft_update=True train.use_target=True seed=2

Div99 commented 2 years ago

Let me know if there are any issues with reproducing the other MuJoCo envs. Happy to help!

Ending2015a commented 2 years ago

These are the parameters used: agent=sac agent.actor_lr=3e-05 agent.critic_lr=0.0003 agent.init_temperature=0.001 agent.learnable_temperature=False env=ant env.learn_steps=1e6 env.demos=10 method.loss=value method.regularize=True num_actor_updates=1 num_seed_steps=0 train.batch=256 train.soft_update=True train.use_target=True seed=2

I reproduced the results with these parameters. thank you!

robfiras commented 2 years ago

These are the parameters used: agent=sac agent.actor_lr=3e-05 agent.critic_lr=0.0003 agent.init_temperature=0.001 agent.learnable_temperature=False env=ant env.learn_steps=1e6 env.demos=10 method.loss=value method.regularize=True num_actor_updates=1 num_seed_steps=0 train.batch=256 train.soft_update=True train.use_target=True seed=2

Did you use these hyperparameters in the other mujoco environments as well? If not, it would be great if you could share them as well. Thanks!