Open awni opened 5 years ago
Hello Gabriel, thanks for taking the time to review our article and offer some suggestions. We believe they are really going to help the article! Below you will find detailed responses to some of the issues you raised.
“...curly left arrow (↩) may be misleading...”
“...somewhat inconsistent...presentation of discount factors…” and “...a sentence explaining early on why we would wish to discount future rewards…” and “...label and point to gamma as the discount factor?...”
“...there is an outer expectation missing in your explanation…”
“...I don't think value-based TD can have a claim on ‘most approaches to deep RL’”
“...Gridworld Playground, I got a ‘maximum exploit’”
“...a follow-up on TD(lambda)”
Thanks again for reviewing our article. We hope these responses and changes address your concerns and improve the article
The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
Distill is grateful to Gabriel Synnaeve for taking the time to review this article.
General Comments
Before all this (hopefully constructive) criticism, let me say that the article presentation is awesome.
I think the curly left arrow (↩) may be misleading and would be better replaced with a "+=" sign, or writing that it's a delta/a gradient/an update... In all the article it's consistently meaning "updated with", but a left arrow often means "assign".
There is a somewhat inconsistent (not incorrect!) treatment/presentation of discount factors.
"A pleasant correspondence has emerged." -> there is an outer expectation missing in your explannation, to really make clear that in your example you average over both trajectories and that how $V{MC}(s=(2,3)) = 1/(N=2) sum{i=0}^{N-1} r_i$ (coordinates: x left to right, y bottom to top). I get that it may complicate the message too much at this point, maybe this averaging can be shown under the first cliff world figure of the introduction.
"Over the last few decades, most work in deep RL has preferred TD learning to Monte Carlo. Indeed, most approaches to deep RL use TD-style value updates. That said, a lot of work goes into making these methods more stable." -> I would not be that unequivocal: AlphaGo's and AG Zero's value models are trained on the game outcome (yes they're forwarded/n-stepped in time through the tree search, but it's a Monte Carlo). There have been enough successes of (pure) policy-based training and even zero order methods, that I don't think value-based TD can have a claim on "most approaches to deep RL".
In the final Gridworld Playground, I got a "maximum exploit" (slider all to the left) Monte Carlo agent to get stuck (e.g. in cell (0, 2) trying to go to the left or a loop between (0, 0) and (0, 1)). It seemed like there was no randomness in selecting the max if two or more actions are maximum at the same value, but I checked your code and
Playground.js:greedy_as(s)
looks correct, so I debugged and it seems to be because you do not initialize the Q values all at the same value (e.g. 0) so there is one that is randomly the max and ingreedy_as(s)
you will always have only one element in theoptimal_as
array. A slider with gamma here would make sense. BTW you could link to the javascript code under the figures or at the end of the article, most of it is clear enough.From the current content and presentation, it would be very easy to extend to do a follow-up on TD(lambda).
Overall the article is very clear and explains powerful concepts of reinforcement learning.
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results