Character limps when trained using Gym env

tfederico commented 3 years ago

i tried to train the character using the hyperparams given by @ManifoldFR in #3076 .

However, after 60 millions steps the character averages a reward of ~300/350 and when I test it the character walks by moving always the same foot and then dragging the other one.

Here are my training and enjoy scripts:

train

import os
import gym
import torch.nn as nn
from stable_baselines3 import PPO
from stable_baselines3.common.cmd_util import make_vec_env
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.monitor import Monitor
import pybullet_envs
from pybullet_envs.deep_mimic.gym_env.custom_callbacks import ProgressBarManager
from stable_baselines3.common.callbacks import CallbackList, CheckpointCallback, EvalCallback

# Create log dir
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

model_args = dict(
    norm_reward=False,
    learning_rate=2.5e-6,
    n_epochs=4,
    target_kl=0.05,
    ent_coef=0,
    vf_coef=5.,
    max_grad_norm=100.
)
policy_kwargs = dict(
    activation_fn=nn.ReLU,
    net_arch=[dict(pi=[1024, 512], vf=[1024, 512])],
    log_std_init=-3,
    ortho_init=True,
    optimizer_kwargs=dict(weight_decay=1.0e-5)
)

checkpoint_callback = CheckpointCallback(save_freq=100000, save_path=log_dir)
# Separate evaluation env
eval_env = make_vec_env('HumanoidDeepMimicWalkBulletEnv-v1')
eval_env = VecNormalize(eval_env, norm_reward=model_args['norm_reward'])
eval_callback = EvalCallback(eval_env, best_model_save_path=log_dir,
                             log_path=log_dir, n_eval_episodes=10,
                             eval_freq=5000, deterministic=True)
# Create the callback list
callback = CallbackList([checkpoint_callback, eval_callback])

n_envs = 8
env = DummyVecEnv([lambda : Monitor(gym.make('HumanoidDeepMimicWalkBulletEnv-v1'), log_dir) for _ in range(n_envs)])
#env = make_vec_env('HumanoidDeepMimicWalkBulletEnv-v1', n_envs=n_envs)
env = VecNormalize(env, norm_reward=model_args['norm_reward'])

model = PPO(
    'MlpPolicy',
    env,
    learning_rate=model_args['learning_rate'],
    n_epochs=model_args['n_epochs'],
    ent_coef=model_args['ent_coef'],
    vf_coef=model_args['vf_coef'],
    max_grad_norm=model_args['max_grad_norm'],
    target_kl=model_args['target_kl'],
    tensorboard_log=log_dir,
    policy_kwargs=policy_kwargs
)

n_steps = int(6e7)
with ProgressBarManager(n_steps) as prog_callback: # tqdm progress bar closes correctly
    model.learn(n_steps, callback=[prog_callback, callback])

env.save(log_dir+"vecnormalize.pkl")

enjoy

import time
import torch.nn as nn
from stable_baselines3 import PPO
import pybullet_envs
from stable_baselines3.common.cmd_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize

log_dir = "/tmp/gym/"

policy_kwargs = dict(
    activation_fn=nn.ReLU,
    net_arch=[dict(pi=[1024, 512], vf=[1024, 512])],
    log_std_init=-3,
    ortho_init=True,
    optimizer_kwargs=dict(weight_decay=1.0e-5)
)

env = make_vec_env('HumanoidDeepMimicWalkBulletEnv-v1')
env = VecNormalize.load(log_dir+"vecnormalize.pkl", env)

model = PPO.load(log_dir+"best_model", env=env)

env.render(mode='human')

obs = env.reset()
dones = [False]

while not all(dones):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = env.step(action)
    time.sleep(1./30.)

In deep_mimic_env.py I modified the action space by using a FakeBox class that inherits gym.spaces.Box

class FakeBox(gym.spaces.Box):
    def __init__(self, low, high, shape=None, dtype=np.float32):
        super().__init__(low, high, shape, dtype)

    def sample(self):
        return truncnorm.rvs(self.low, self.high, size=self.shape[0])

tfederico commented 3 years ago

@ManifoldFR may I ask how you obtained the values for the model/policy hyperparams? Did you perform tuning using Optuna as in the RL zoo?

ManifoldFR commented 3 years ago

I started from the parameters of Jason Peng's code, but for things like the maximum grad norm, target KL or vf coef I had to make guesses because these were not parameters in his PPO implementation (also he had two separate optimizers for the policy and value functions).

On Tue, 24 Nov 2020, 16:37 Federico, notifications@github.com wrote:

@ManifoldFR https://github.com/ManifoldFR may I ask how you obtained the values for the model/policy hyperparams? Did you perform tuning using Optuna as in the RL zoo?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bulletphysics/bullet3/issues/3161#issuecomment-733054328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFA427HJQ7DP7TAFYUPRE4DSRPHMLANCNFSM4UBBMUGA .

erwincoumans commented 3 years ago

Training sometimes get stuck in such behavior. Did you try a couple of training runs?

ManifoldFR commented 3 years ago

What about the discount factor and lambda parameter for TD(lambda) ? Also, are you using my branch with the modifications to the Gym env? Here's a dropbox link with a policy trained with this (I think for that run I set a slightly higher learning rate)

tfederico commented 3 years ago

@erwincoumans I tried a couple of runs using my script and another one using the training script from stable-baselines 3 zoo

@ManifoldFR I used the default values for the discount factor and lambda parameter. Did you use custom values? I wondered you also used the default ones given that you didn't list them with the other params. I used the version with action/observation scaling, so I guess it's the same.

ManifoldFR commented 3 years ago

Sorry about that, I use a strategy where I have a default set of PPO params on top of SB3's defaults, and the values I gave you were the overrides for the both of them. Check the hyperparams.yml in the Dropbox link I sent, I use the same discount and lambda (0.95) as Jason Peng. I think one of the important things was I use 4096 timesteps per env per rollout

On Tue, 24 Nov 2020, 16:47 Federico, notifications@github.com wrote:

@erwincoumans https://github.com/erwincoumans I tried a couple of runs using my script and another one using the training script from stable-baselines 3 zoo

@ManifoldFR https://github.com/ManifoldFR I used the default values for the discount factor and lambda parameter. Did you use custom values? I wondered you also used the default ones given that you didn't list them with the other params. I used the version with action/observation scaling, so I guess it's the same.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bulletphysics/bullet3/issues/3161#issuecomment-733060538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFA427DM23A7MY6NJZJAP3DSRPISNANCNFSM4UBBMUGA .

tfederico commented 3 years ago

Ah I see, no worries!

tfederico commented 3 years ago

I was wondering whether I was doing something wrong in the training setup or when loading the model, but I figured there might be something wrong with the parameters given that the training would get stuck

ManifoldFR commented 3 years ago

Yes, the method is quite brittle I'm afraid, some hyperparameters can send you to very bad local minima. Have you looked at other papers like Facebook's ScaDiver ? The approach is the same but the subreward aggregation/early termination strategies are different. Maybe it's more robust but I haven't tested yet

tfederico commented 3 years ago

I haven't read the paper but I saw their repo and video, seems very promising. I am trying to stick with DeepMimic because I don't want to change everything halfway :)

Also, if I recall correctly, they use a different format for clips (3d joints instead of quaternions maybe?), so I would have to adapt the tracking algorithm to that as well.

ManifoldFR commented 3 years ago

They use the more standard BVH format instead of the custom format used in deepmimic, they have code to convert to character poses in reduced coordinates to supply to pybullet

On Tue, 24 Nov 2020, 17:18 Federico, notifications@github.com wrote:

I haven't read the paper but I saw their repo and video, seems very promising. I am trying to stick with DeepMimic because I don't want to change everything halfway :)

Also, if I recall correctly, they use a different format for clips (3d joints instead of quaternions maybe?), so I would have to adapt the tracking algorithm to that as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bulletphysics/bullet3/issues/3161#issuecomment-733085363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFA427B7A3AWBIY5AWRD4FDSRPMGPANCNFSM4UBBMUGA .

tfederico commented 3 years ago

Btw I could help but notice that in deep_mimic_env.py you calculate the reward before applying the new action, is that intentional?

reward = self._internal_env.calc_reward(agent_id)

# Apply control action
self._internal_env.set_action(agent_id, action)

I don't think it would actually make a huge difference, but it seemed a bit odd.

ManifoldFR commented 3 years ago

That's something I'm not 100% sure about. DeepMimic's interaction loop is pretty non-standard and it's hard to tell when the rewards are calculated: I think it's with respect to the current state s_t before applying the action a_t (and getting to state s_{t+1}) rather than afterwards.

IMO either one works as long as you make sure the reference pose you're comparing the state to is the right one (same time step). ScaDiver computes rewards wrt the state at time t (using state data from before applying the action) https://github.com/facebookresearch/ScaDiver/blob/96001537f9ab2eddfe871b78807923a30f7d012f/env_humanoid_base.py#L368-L385

bulletphysics / bullet3

Character limps when trained using Gym env #3161

train

enjoy