Possible errors in 04_PG.ipynb ?

alessiodm / drl-zh

Deep Reinforcement Learning: Zero to Hero!

MIT License

1.96k stars 68 forks source link

Possible errors in 04_PG.ipynb ? #5

Closed fancyfredbot closed 3 days ago

fancyfredbot commented 1 month ago

I may have misunderstood everything here, but I think perhaps the guidance/solution are wrong in the final stage of this notebook, specifically with the baseline and the normalisation.

Baseline

The solution uses the average future reward across all of the timesteps in a single episode. So if the future rewards look like [5,4,3,2,1] then the average future reward is 3, and subtracting this as a baseline gives you [2,1,0-1,-2]. The later timesteps have negative rewards now. Infact later timesteps always end up with negative rewards as the future reward is always a decreasing function of time.
Subtracting the average future reward duplicates what the normalisaton step is doing (though I think this is also unhelpful). The mean is always zero after subtracting the baseline so this part of the normalisation now does nothing.
I think the baseline for the future reward needs to be a different value for each timestep?

Normalisation

I think the average and standard deviation need to be estimated per-timestep based on multiple trajectories, not per-trajectory based on multiple timesteps.
The normalisation as suggested/implemented is normalising the entire trajectory based on a single scalar mean future reward, suffering from the same issue as the baseline and ending up with negative normalised rewards at the end.
I tried the solution version of REINFORCE_v2 with a few different seeds and I can see that removing the normalisation and baseline steps actually makes it converge significantly more quickly
I tried using the last 5 trajectories to estimate the mean/std at each timestep for normalisation (if a trajectory is too short to contribute to the avg/stdev I ignore it), This seems to converge even faster!

fancyfredbot commented 1 month ago

Another comment - noticed that I was getting different results every time. Traced it to the initial weights of the policy network. Forcreproducibility you need to call init_random() before constructung the agent, or torch will give you a different random weight initialisation every time.

alessiodm commented 1 month ago

Thank you so much for opening this issue. Both your observations are exactly right, and the notebook is incorrect as currently implemented. The errors are not possible: they are certain.

Mean and stdev must be computed per timestamp (of future reward) across multiple trajectories to capture the (likely shifting during training) distribution of rewards. The baseline is also pointless as currently implemented, performing a (badly computed) mean subtraction which is also incorrect.

Your suggested fix for normalization (e.g., use the last 5 trajectories) is very nice, and captures the concept correctly without requiring a vectorized environment (which I wanted to avoid at this stage of the notebooks).

Regarding the baseline, I was thinking how to properly introduce it for the CartPole environment but I am struggling in finding a simple one that fits the bill. We need to get something that provides a better "advantage" signal than the raw normalized future rewards. Maybe just penalizing somehow short trajectories (even if all rewards are positive) b/c we are clearly pushing the pole in the wrong direction? Please, let me know if you have any ideas - we can even change environment in case.

I am going to open a PR this week with the normalization fixes and possibly a proper baseline, I'd be happy if you'd like to review it. Alternatively, if you'd like to contribute yourself to the notebooks and open a PR just let me know and I can hold off :)

P.S.: I fixed instantiating the agent within the init_random, and I also checked the other notebooks to make sure they do the same. Thanks once more!

fancyfredbot commented 1 month ago

I am also unsure about what makes a good baseline for this cartpole example. Sutton and Barto below seem to suggest you can train a network to assign a value to each state and use that to scale the reward.

I think such a value function is going to say states too near the edge of the track, or with poles at high angles, or moving fast, are low value. So I tried a baseline of the average score on that timestep over last 10 runs, plus this hand-tuned function designed to go positive and penalise further when we're too far along the track or the pole leans too far:

baseline = lambda t,state : moving_avg_score[t] + (10*state[2])**2 + (0.6*state[0])**2 -1

This seems marginally better than just using moving_avg_score[t]. But I'm not really sure this is at all illustrative of how a baseline "should" be done? Maybe better to train a network?

1y1Zf

Very happy to review a PR!

alessiodm commented 1 month ago

Yes, definitely training a neural network for the baseline would work (and it is what actor-critic methods effectively do). But at this stage of the notebooks I wanted to keep things simpler if possible.

I was looking into normalization for the CartPole example, and it turns out it hinders the performance instead of helping. Given that:

We collect data from N environments and we stop when the first one is done;
The reward scheme of CartPole is +1 for every time step.

The mean (cross-environments) is always 1.0 and the stdev always ~0.0, affecting learning negatively to the point it doesn't work.

I am working to see if we can use a "well crafted" reward function to showcase all these topics, but it'll take me a little bit longer to open a PR (I am also improving the overall guidance of the notebook).

fancyfredbot commented 1 month ago

I was looking into normalization for the CartPole example, and it turns out it hinders the performance instead of helping

I found that normalisation can work, but you have to normalise the future reward. This will mean that episodes where the cartpole stays upright for longer than the recent average are given a positive reward whereas episodes where the cartpole falls earlier than average are given more negative rewards. This does help learning. But adding a baseline was better still. I think that problem is that this normalisation this will apply a negative reward to every action in a 450 step episode if the average is 475 steps. However the first 400 steps probably are all fine and should not be given a negative reward.

Yes, definitely training a neural network for the baseline would work (and it is what actor-critic methods effectively do). But at this stage of the notebooks I wanted to keep things simpler if possible.

I agree that reinforce with a neural network baseline is very similar to actor critic so it could be confusing to introduce that here

alessiodm commented 1 month ago

I created PR #6 but I didn't find a way to assign it to someone other than myself for review :( so I merged it directly. Please, feel free to point out anything that you may see wrong in the new version and I'll be happy to tackle it.

As a TL;DR, I decided to introduce gym.vector.Env for REINFORCE_v2. Normalization is across independent trajectories for the same version of the policy, and using past trajectories as a shortcut (or simplification) would be slightly incorrect from a theoretical point of view (even if it might work well enough).

In order to showcase the meaning and advantages of normalization, I had to tweak the reward function of Cart Pole in a way that it actually reflects the "quality" of the actions. That is a good segway to also showing how designing rewards is a difficult and tricky problem in RL.

Finally, I carefully designed a baseline that almost always beats just normalization (based on the Cart Pole state) and used 10 random seeds to create a graph showing that.

I believe the notebook is now good enough to be representative of basic policy gradients techniques, but I am very curious to hear your feedback @fancyfredbot before closing this issue. Thanks again!

alessiodm commented 3 days ago

I am resolving this issue for now, please feel free to reopen for any concerns.