Policy Gradients Fixes for Issue #5

The current policy gradient notebook has critical implementation errors, in particular w.r.t. normalization and baseline. See issue #5 for context.

This PR updates the Policy Gradients notebook with proper normalization and baseline implementation across multiple trajectories, collected via Gymnasium vector environment.

In order to achieve proper results (and targeting user learning), the Cart Pole reward function is tweaked / adjusted in order to better reflect the quality of an action (via the quality of the next state). This enables demonstrating how normalization can help, and how carefully designed baselines can also improve on top of it.

The suggested baseline has been chosen carefully to show how a baseline can indeed improve learning performance. A remark about "Rewart" and how critical is designing reward functions is in RL has been added to the notebook, as well as expanded "theory" sections.

alessiodm / drl-zh

Policy Gradients Fixes for Issue #5 #6