Closed Sycor4x closed 5 years ago
Thanks for pointing this! Will fix and commit this.
Have you noticed any effect on convergence dynamics? For diverse batch samples it could potentially eliminate the dueling effect due to wrong advantage value.
You're welcome. I haven't extensively tested and compared the two versions.
https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/a307a952bb914a1b5a43b7c92b7237ca28d88d28/Chapter08/lib/models.py#L56
This is a minor hiccup -- the line currently takes the average across the entire batch_size * n_actions tensor, making
adv.mean()
just a scalar value. Instead, each observation in the mini-batch should have the mean subtracted off, so it should beout = val + adv - adv.mean(dim=1, keepdim=True)
which hasbatch_size
elements.