btaba / intro-to-rl

coding examples to Intro to RL
MIT License
12 stars 6 forks source link

racetrack problem average return per episode keep vibrating #2

Closed xubo92 closed 7 years ago

xubo92 commented 7 years ago

Hi @btaba: I have tried average return per episode and total return per episode on about 100 episodes and see what happens. They are all like random vibration. Is the episode number too small. Seems not right.Very confused. Could you please test your training process with the average or total return per episode on about 100 episodes? Do the performance have obvious improvements in about first hundreds of episodes? Here is my assessmen result:

racetrack_result_1

btaba commented 7 years ago

Hi @lvlvlvlvlv, It looks like plotting training rewards per episode does result in a similar plot to what you have above. If however, you plot greedy rewards per training episode, you do see the curve go up. In other words, instead of performing an episode with the stochastic policy, perform an episode with the deterministic greedy policy just to see if the agent has learned the environment. Then continue learning as normal without the greedy policy. The plot is reproduced in the notebook here.

image

xubo92 commented 7 years ago

@btaba Thank you for your answer~ I have tried another test on about 2000 iterations and it looks that the performance is obvious improved after about 500 iterations.Perhaps the first sevral hundred iterations is indeed with some random vibration.