andrewliao11 / gail-tf

Tensorflow implementation of generative adversarial imitation learning
MIT License
201 stars 47 forks source link

how to define reward funtion if i want to generate a policy for specific task? #10

Closed jiameij closed 6 years ago

jiameij commented 6 years ago

hi~ thanks for sharing the code, it helps me a lot! i have a question about reward function i want to train the expert for running of humanoid-v2 using trpo algorithm, how can i define the reward function? or is there any other methods to achieve this? thank you very much!

andrewliao11 commented 6 years ago

I'm not sure if I understand your question. If you're using humanoid environment in openai/gym, they have design suitable reward function for you (at least, that reward function can lead to some magic)

Here's their reward function for humanoid [link]:

pos_after = mass_center(self.model, self.sim)
alive_bonus = 5.0
data = self.sim.data
lin_vel_cost = 0.25 * (pos_after - pos_before) / self.model.opt.timestep
quad_ctrl_cost = 0.1 * np.square(data.ctrl).sum()
quad_impact_cost = .5e-6 * np.square(data.cfrc_ext).sum()
quad_impact_cost = min(quad_impact_cost, 10)
reward = lin_vel_cost - quad_ctrl_cost - quad_impact_cost + alive_bonus
jiameij commented 6 years ago

thanks for your reply! i mean that, should we design different reward functions for different tasks(for example running, standing up or walking)? from (https://drive.google.com/drive/folders/1h3H4AY_ZBx08hz-Ct0Nxxus-V1melu1U), i find there are two tasks for humanoid : humanoid and humanoidStandup, do you use the same reward function? and i run the expert policy by the following code: import numpy as np import gym import time env = gym.make("Humanoid-v2") env.reset() traj_data = np.load("/home/jmj/Downloads/deterministic.trpo.HumanoidStandup.0.00.npz") acs = traj_data['acs'] for tarj_acs in acs: for ac in tarj_acs: env.step(ac) env.render() time.sleep(0.02)
but humanoid can't stand up, do you know why? thank you again!