MorvanZhou / Reinforcement-learning-with-tensorflow

Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学
https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/
MIT License
8.86k stars 5k forks source link

Some questions about PPO #10

Closed 20chase closed 7 years ago

20chase commented 7 years ago

Hello Zhou,

What a nice code. But I have some questions about your PPO part.

  1. Do you test other continue space env like Mujoco? I test the Simple PPO code in Mujoco, which look like the code doesn't work. On the other hand, Pendulum-v0 result is good but not very good, which mean the best 100-episode average reward was below -200.

  2. In Simple PPO code, you use the reward normalize. But I don't understand why the reward normalize can improve the result so much? You know, the reward normalize in the code is characteristic engineering, we had better not do it by ourselves.

I am looking forward your reply. : )

MorvanZhou commented 7 years ago

Hi @20chase ,

1.Do you test other continue space env like Mujoco? I test the Simple PPO code in Mujoco, which look like the code doesn't work. On the other hand, Pendulum-v0 result is good but not very good, which mean the best 100-episode average reward was below -200.

I haven't tried the Mujoco yet, and the simple_PPO code is a single-thread based structure, the data an agent collected is highly correlated, which is not good for training. But the simplePPO is a good example that demonstrates the PPO's structure. A better result could be achieved using multi-agents like A3C (My DPPO adopts this idea).

  1. In Simple PPO code, you use the reward normalize. But I don't understand why the reward normalize can improve the result so much? You know, the reward normalize in the code is characteristic engineering, we had better not do it by ourselves.

The normalization process is mentioned in google's DPPO paper. The reason under the hood should be that they consider a future return and discounted reward (not TD learning). I found in the Pendulum's environment, the reward is always negative. Taking account the discount factor when reward closes to zero (when pendulum achieves the highest point). the future return goes toward zero if it keeps staying at the highest point. While if the highest reward is a positive number, the future return can be accumulated to a larger number (I think the cumulative fact encourages learning).