PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Hands-on Deep Reinforcement Learning, published by Packt
MIT License
2.83k stars 1.28k forks source link

Chapter 18: difference from original paper #16

Closed ryuoujisinta closed 5 years ago

ryuoujisinta commented 5 years ago

When I read the original paper by D. Silver et al., I came across 2 difference between the paper and your code.

  1. Original paper uses L2 regularization, but you don't (train.py, line 64).
  2. Your code retains all sub trees during an episode, but original paper says that sub trees are retained only if the are children of the selected action.

If you have any reasons for these, please let us know.

Shmuma commented 5 years ago

Hi!

  1. Original paper uses L2 regularization, but you don't (train.py, line 64).

The main reason is that I've missed this point in the paper :). So, it could be a good experiment to check the effect of L2 regularisation on final result.

  1. Your code retains all sub trees during an episode, but original paper says that sub trees are retained only if the are children of the selected action.

Hm, that's interesting. I remember this part of paper (it's at the bottom of the page 26), and from my perspective it contradicts itself, as it says "The search tree is reused at subsequent time-step", but then "while the remainder of the tree is discarded".

From my perspective, discarding parts of the tree (even if they haven't been used in this search) is harmful for MCTS statistics, so, I've decided to keep all the nodes and prune the tree only when new best policy is elected. Probably, I'm wrong (or just didn't understand this part of the paper).