andrewliao11 / gail-tf

Tensorflow implementation of generative adversarial imitation learning
MIT License
200 stars 47 forks source link

the difference of set of 'stochastic' between 'traj_episode' and 'traj_segement' #6

Closed LinBornRain closed 6 years ago

LinBornRain commented 6 years ago

Hi, andrewliao11!: Its me again... In 'trpo_mpi.learn()', the codes defines 'stochastic' as 'True' in 'traj_segement'; But, in In 'trpo_mpi.evaluate()', the codes defines 'stochastic' as default 'False' in 'traj_episode'. Should it be difference in 'evaluate', when it seems output total different trajectory performance if learning with stochastic policy and evaluate with deterministic policy. THX!

andrewliao11 commented 6 years ago

hi @LinBornRain , I'm not sure if I get your point. For the sake of convenience, I set the default of the stochastic in traj_episode to be False to for evaluating deterministic policy. Putting this flag into the argument of evaluate is a good idea. I will first close the issue and welcome to reopen it again.

LinBornRain commented 6 years ago

Hi, andrewliao11: When I try to recover the results in your codes, I notice that GAIL-trpo is trained with stochastic policy in 'traj_segment' but the stochastic policy is evaluated in a deterministic mode as default set as u said. Also, I have tried: 1.using deterministic policy in training GAIL-trpo, and the learning dose not work anymore.

  1. still using stochastic policy in training, and evaluated in a stochastic mode. The performance of stochastic policy evaluated in stochastic mode turns out worse than evaluated in deterministic mode. So I am confused about the inconsistency of default set in evaluation.
andrewliao11 commented 6 years ago

using stochastic mode during training is for exploration, while using deterministic policy to evaluate is for exploiting. check the literature of exploration versus exploitation

LinBornRain commented 6 years ago

That makes sense to me! THX a lot!