HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.26k stars 239 forks source link

GAIL and AIRL don't work #680

Closed mertalbaba closed 1 year ago

mertalbaba commented 1 year ago

Bug description

Your adversarial model implementations, including GAIL and AIRL, does not work well in MuJoCo environments. Tested on Hopper, HalfCheetah and Humanoid, and both AIRL and GAIL failed to reach a meaningful score after 100k and 1 million steps of training.

Steps to reproduce

import numpy as np
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.ppo import MlpPolicy

from imitation.algorithms.adversarial.gail import GAIL
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env

rng = np.random.default_rng(0)

env = gym.make("HalfCheetah-v3")
expert = SAC(policy=MlpPolicy, env=env, n_steps=64)
expert.learn(100000)

rollouts = rollout.rollout(
    expert,
    make_vec_env(
        "HalfCheetah-v3",
        n_envs=5,
        post_wrappers=[lambda env, _: RolloutInfoWrapper(env)],
        rng=rng,
    ),
    rollout.make_sample_until(min_timesteps=100000, min_episodes=60),
    rng=rng,
)

venv = make_vec_env("HalfCheetah-v3", n_envs=8, rng=rng)
learner = SAC(env=venv, policy=MlpPolicy)
reward_net = BasicRewardNet(
    venv.observation_space,
    venv.action_space,
    normalize_input_layer=RunningNorm,
)
gail_trainer = GAIL(
    demonstrations=rollouts,
    demo_batch_size=1024,
    gen_replay_buffer_capacity=2048,
    n_disc_updates_per_round=4,
    venv=venv,
    gen_algo=learner,
    reward_net=reward_net,
)

gail_trainer.train(100000)
rewards, _ = evaluate_policy(learner, venv, 100, return_episode_rewards=True)
print("Rewards:", rewards)

Environment

AdamGleave commented 1 year ago

Performance of RL and imitation learning algorithms is very sensitive to hyperparameters and implementation details. Please try using the scripts we provide for training and tuned hyperparameters in https://github.com/HumanCompatibleAI/imitation/tree/master/benchmarking If those results don't line up with those in the paper https://arxiv.org/pdf/2211.11972.pdf then please do let us know.

mertalbaba commented 1 year ago

The problem is that hyperparameters are only provided for some environments (no params. for Humanoid) and only provided for limited combinations (only with PPO). Do you plan to extend parameters for Humanoid and SAC/TRPO + GAIL/AIRL?

AdamGleave commented 1 year ago

We don't currently have plans to extend them but we'd welcome PRs on this. You can use the scripts in https://github.com/HumanCompatibleAI/imitation/pull/675 to do that

mertalbaba commented 1 year ago

Ok, thanks. I found hyper-parameters for GAIL-SAC in all Gym environments, will try to make a pull request for all this week.

Rowing0914 commented 9 months ago

@mertalbaba Hi! Thank you for raising this issue. Where can we find your params?