Closed LinBornRain closed 6 years ago
hi @LinBornRain ,
I'm not sure if I get your point. For the sake of convenience, I set the default of the stochastic
in traj_episode
to be False to for evaluating deterministic policy.
Putting this flag into the argument of evaluate is a good idea.
I will first close the issue and welcome to reopen it again.
Hi, andrewliao11: When I try to recover the results in your codes, I notice that GAIL-trpo is trained with stochastic policy in 'traj_segment' but the stochastic policy is evaluated in a deterministic mode as default set as u said. Also, I have tried: 1.using deterministic policy in training GAIL-trpo, and the learning dose not work anymore.
using stochastic mode during training is for exploration, while using deterministic policy to evaluate is for exploiting. check the literature of exploration versus exploitation
That makes sense to me! THX a lot!
Hi, andrewliao11!: Its me again... In 'trpo_mpi.learn()', the codes defines 'stochastic' as 'True' in 'traj_segement'; But, in In 'trpo_mpi.evaluate()', the codes defines 'stochastic' as default 'False' in 'traj_episode'. Should it be difference in 'evaluate', when it seems output total different trajectory performance if learning with stochastic policy and evaluate with deterministic policy. THX!