alessiodm / drl-zh

Deep Reinforcement Learning: Zero to Hero!
MIT License
1.96k stars 68 forks source link

Possible errors in 04_PG.ipynb ? #5

Closed fancyfredbot closed 3 days ago

fancyfredbot commented 1 month ago

I may have misunderstood everything here, but I think perhaps the guidance/solution are wrong in the final stage of this notebook, specifically with the baseline and the normalisation.

Baseline

Normalisation

fancyfredbot commented 1 month ago

Another comment - noticed that I was getting different results every time. Traced it to the initial weights of the policy network. Forcreproducibility you need to call init_random() before constructung the agent, or torch will give you a different random weight initialisation every time.

alessiodm commented 1 month ago

Thank you so much for opening this issue. Both your observations are exactly right, and the notebook is incorrect as currently implemented. The errors are not possible: they are certain.

Mean and stdev must be computed per timestamp (of future reward) across multiple trajectories to capture the (likely shifting during training) distribution of rewards. The baseline is also pointless as currently implemented, performing a (badly computed) mean subtraction which is also incorrect.

Your suggested fix for normalization (e.g., use the last 5 trajectories) is very nice, and captures the concept correctly without requiring a vectorized environment (which I wanted to avoid at this stage of the notebooks).

Regarding the baseline, I was thinking how to properly introduce it for the CartPole environment but I am struggling in finding a simple one that fits the bill. We need to get something that provides a better "advantage" signal than the raw normalized future rewards. Maybe just penalizing somehow short trajectories (even if all rewards are positive) b/c we are clearly pushing the pole in the wrong direction? Please, let me know if you have any ideas - we can even change environment in case.

I am going to open a PR this week with the normalization fixes and possibly a proper baseline, I'd be happy if you'd like to review it. Alternatively, if you'd like to contribute yourself to the notebooks and open a PR just let me know and I can hold off :)

P.S.: I fixed instantiating the agent within the init_random, and I also checked the other notebooks to make sure they do the same. Thanks once more!

fancyfredbot commented 1 month ago

I am also unsure about what makes a good baseline for this cartpole example. Sutton and Barto below seem to suggest you can train a network to assign a value to each state and use that to scale the reward.

I think such a value function is going to say states too near the edge of the track, or with poles at high angles, or moving fast, are low value. So I tried a baseline of the average score on that timestep over last 10 runs, plus this hand-tuned function designed to go positive and penalise further when we're too far along the track or the pole leans too far:

baseline = lambda t,state : moving_avg_score[t] + (10*state[2])**2 + (0.6*state[0])**2 -1

This seems marginally better than just using moving_avg_score[t]. But I'm not really sure this is at all illustrative of how a baseline "should" be done? Maybe better to train a network?

1y1Zf

Very happy to review a PR!

alessiodm commented 1 month ago

Yes, definitely training a neural network for the baseline would work (and it is what actor-critic methods effectively do). But at this stage of the notebooks I wanted to keep things simpler if possible.

I was looking into normalization for the CartPole example, and it turns out it hinders the performance instead of helping. Given that:

The mean (cross-environments) is always 1.0 and the stdev always ~0.0, affecting learning negatively to the point it doesn't work.

I am working to see if we can use a "well crafted" reward function to showcase all these topics, but it'll take me a little bit longer to open a PR (I am also improving the overall guidance of the notebook).

fancyfredbot commented 1 month ago

I was looking into normalization for the CartPole example, and it turns out it hinders the performance instead of helping

I found that normalisation can work, but you have to normalise the future reward. This will mean that episodes where the cartpole stays upright for longer than the recent average are given a positive reward whereas episodes where the cartpole falls earlier than average are given more negative rewards. This does help learning. But adding a baseline was better still. I think that problem is that this normalisation this will apply a negative reward to every action in a 450 step episode if the average is 475 steps. However the first 400 steps probably are all fine and should not be given a negative reward.

Yes, definitely training a neural network for the baseline would work (and it is what actor-critic methods effectively do). But at this stage of the notebooks I wanted to keep things simpler if possible.

I agree that reinforce with a neural network baseline is very similar to actor critic so it could be confusing to introduce that here

alessiodm commented 1 month ago

I created PR #6 but I didn't find a way to assign it to someone other than myself for review :( so I merged it directly. Please, feel free to point out anything that you may see wrong in the new version and I'll be happy to tackle it.

As a TL;DR, I decided to introduce gym.vector.Env for REINFORCE_v2. Normalization is across independent trajectories for the same version of the policy, and using past trajectories as a shortcut (or simplification) would be slightly incorrect from a theoretical point of view (even if it might work well enough).

In order to showcase the meaning and advantages of normalization, I had to tweak the reward function of Cart Pole in a way that it actually reflects the "quality" of the actions. That is a good segway to also showing how designing rewards is a difficult and tricky problem in RL.

Finally, I carefully designed a baseline that almost always beats just normalization (based on the Cart Pole state) and used 10 random seeds to create a graph showing that.

I believe the notebook is now good enough to be representative of basic policy gradients techniques, but I am very curious to hear your feedback @fancyfredbot before closing this issue. Thanks again!

alessiodm commented 3 days ago

I am resolving this issue for now, please feel free to reopen for any concerns.