Review #2 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Erich Elsen for taking the time to review this article.

General Comments

Mentions Go as canonical example of RL success, but it uses Monte Carlo Tree Search - yet the article makes it seem like Monte Carlo is hardly ever the right thing to use. These aren't exactly the same thing, but the distinction might be lost on people who lock onto the Monte Carlo bit. Explaining all of AlphaGo clearly too much for the article, but something to clarify the point for the lay reader who notices the “Monte Carlo” in both places and wonders what is going on might be helpful.

I find Example 6.4 of the 1st edition of Sutton & Barto to also provide useful intuition on the differences between Monte Carlo and TD.

Minor nit: Falling off the cliff usually has a much more negative reward, -10?

I would find it more natural to write the expansion of the Value function into the reward at (t+1) and the discounted V at (t+2) instead at t and t+1. I understand the motivation was to allow it to be directly plugged into the next equation, but it requires generalization on the next line in any case.

The diagram where you can change the reward of the “blue” square is a bit confusing. There are really two colors of blue in the diagram at the default settings (although one is “more” blue than the other. English lets us down here, Russian wouldn’t have this problem.) The second confusing bit is that the default value on the slider is .5, but the value shown in the diagram is always 2. These two things combined made me slightly confused at first until I realized I was changing the value of going to the “blue” circle.

The section on double Q-learning makes me think that learning distributions rather than just means would provide more information, as then one could estimate variance of paths in addition to expectations. I feel like this must already have been done. Perhaps worth a mention, since if I thought of that immediately, likely others will as well.

It would be nice if there was a way to speed up the gridworld playground at the end. Agents can wander around for quite some time before hitting the goal or the cliff and collecting enough experience to converge is a bit tedious. Maybe just add a button that says “Add 5 agents”. Or to give you an idea on how efficient the different algorithms are, make visible the amount of experience collected so far. And allow experience to be reset?

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Outstanding Communication	Score
Article Structure	5/5
Writing Style	5/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	4/5
Readability	5/5

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	5/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	5/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	5/5

Hi Erich. First of all, thanks for reading our article and sending us some well-thought-out comments and suggestions. It’s great to get some outside perspective on what parts of the article need work/re-wording. In response to your review, we’ve made several changes and also addressed some of your concerns point-by-point:

“...Go as canonical example of RL success, but it uses Monte Carlo Tree Search…”

Thanks for pointing this out. We agree with you that this reference can cause some confusion. As such, we changed the reference in the Introduction to one about StarCraft. Also, we added a new sentence to the discussion in the “TD or not TD?” section (link to commit)

“...write the expansion of the Value function into the reward at (t+1) and the discounted V at (t+2) instead at t and t+1.”

Thanks for pointing this out. We also talked about how to structure this and, like you suspected, we chose that notation because it lets us make a direct substitution on the next line.

“...The diagram where you can change the reward of the “blue” square is a bit confusing”

We can see why this might be confusing. After all, “the blue q value” refers to “the blue q value in the middle figure.” We reworded the caption to read “Weight of upward path:” (link to commit)

“...learning distributions rather than just means would provide more information”

Yes! There is some really cool work being done on this. Perhaps the best example is the C51 algorithm which was SOTA on Atari for awhile (https://deepmind.com/blog/going-beyond-average-reinforcement-learning/). We didn’t touch on this because it seemed like too much of a digression from our original discussion about Q-values, but we did talk about including it.

Thanks again for reviewing our article. We hope these responses and changes address your concerns and improve the article!

distillpub / post--td-paths

Review #2 #5

General Comments