Review #1 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Gabriel Synnaeve for taking the time to review this article.

General Comments

Before all this (hopefully constructive) criticism, let me say that the article presentation is awesome.

I think the curly left arrow (↩) may be misleading and would be better replaced with a "+=" sign, or writing that it's a delta/a gradient/an update... In all the article it's consistently meaning "updated with", but a left arrow often means "assign".

There is a somewhat inconsistent (not incorrect!) treatment/presentation of discount factors.

A misleading thing in the presentation for neophyte readers is that Monte Carlo is on return (R_t) and TD equations are on discounted (gamma) rewards (r_t). It would be better to unify both and have Monte Carlo either be written with the discount, or at least talk about/introduce the (discounted) return at the beginning (and maybe write it G_t instead of R_t), or omit the discounts alltogether. I don't like this last option much, actually I'd prefer that there was a sentence explaining early on why we would wish to discount future rewards – if you search for "discount" it doesn't appear in plain text in the article. Anyways I believe there needs to be a (can be very quick/simple) clarification of what a reward and what a return are and how they differe. You can also easily show an x-y plot with x=timestep and y=gamma^x for a few values of gamma if you add a sentence about it. Another nitpick on the form is that you show a gamma in most equations but no gradation of color of your paths (which implicitely reads as gamma=1), except in the playground at the end.
In the first equation of "Beating Monte Carlo" it may be interpreted that $\gamma V(s_{t+1}$ is the "Next state value", maybe label and point to gamma as the discount factor? [Also, this equation is specifically 1-step TD(0), it may be confusing to get into this level of details (yet two other (n-step + eligibility traces) dimensions...).]

"A pleasant correspondence has emerged." -> there is an outer expectation missing in your explannation, to really make clear that in your example you average over both trajectories and that how $V{MC}(s=(2,3)) = 1/(N=2) sum{i=0}^{N-1} r_i$ (coordinates: x left to right, y bottom to top). I get that it may complicate the message too much at this point, maybe this averaging can be shown under the first cliff world figure of the introduction.

"Over the last few decades, most work in deep RL has preferred TD learning to Monte Carlo. Indeed, most approaches to deep RL use TD-style value updates. That said, a lot of work goes into making these methods more stable." -> I would not be that unequivocal: AlphaGo's and AG Zero's value models are trained on the game outcome (yes they're forwarded/n-stepped in time through the tree search, but it's a Monte Carlo). There have been enough successes of (pure) policy-based training and even zero order methods, that I don't think value-based TD can have a claim on "most approaches to deep RL".

In the final Gridworld Playground, I got a "maximum exploit" (slider all to the left) Monte Carlo agent to get stuck (e.g. in cell (0, 2) trying to go to the left or a loop between (0, 0) and (0, 1)). It seemed like there was no randomness in selecting the max if two or more actions are maximum at the same value, but I checked your code and Playground.js:greedy_as(s) looks correct, so I debugged and it seems to be because you do not initialize the Q values all at the same value (e.g. 0) so there is one that is randomly the max and in greedy_as(s) you will always have only one element in the optimal_as array. A slider with gamma here would make sense. BTW you could link to the javascript code under the figures or at the end of the article, most of it is clear enough.

From the current content and presentation, it would be very easy to extend to do a follow-up on TD(lambda).

Overall the article is very clear and explains powerful concepts of reinforcement learning.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Outstanding Communication	Score
Article Structure	3/5
Writing Style	4/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	5/5
Readability	4/5

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	3/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	4/5

Hello Gabriel, thanks for taking the time to review our article and offer some suggestions. We believe they are really going to help the article! Below you will find detailed responses to some of the issues you raised.

“...curly left arrow (↩) may be misleading...”

The reason we use an “updates towards” rather than the more specific “+=” notation is that we want this notation to extend to updates to parametric functions, such as gradient updates to a neural network (where the “+=” notation would not be appropriate).
We realize that this can be confusing, so we included a footnote that reads “In tabular settings such as the Cliff World example, this “updates toward” operator computes a running average. In this case, we can rewrite the n^{th} Monte Carlo update as {equation}. Since this notation is not as clean or general, we chose to use the \hookleftarrow notation instead.”
We updated this footnote as follows: “In tabular settings such as the Cliff World example, this “update towards” operator computes a running average. More specifically, the n^{th} Monte Carlo update is {equation} and we could just as easily use the “+=” notation. But when using parameteric function approximators such as neural networks, our “update towards” operator may represent a gradient step, which cannot be written in “+=” notation. In order to keep our notation clean and general, we chose to use the \hookleftarrow operator throughout.” (link to commit)

“...somewhat inconsistent...presentation of discount factors…” and “...a sentence explaining early on why we would wish to discount future rewards…” and “...label and point to gamma as the discount factor?...”

Thanks for pointing this out. We agree that there should be some discussion of discount factors in the main body of the article. We tried adding a discussion of both return and discounted rewards (which, as you pointed out, are closely related) at the end of the Introduction.
Just before ‘Estimating value by updating,’ we added: "The term on the right is called the return and we use it to measure the amount of long-term reward an agent earns. The return is just a weighted sum of future rewards $r{t} + \gamma r{t+1} + \gamma^2 r_{t+2} + …$ where \gamma is a discount factor which controls how much short term rewards are worth relative to long-term rewards." (link to commit)

“...there is an outer expectation missing in your explanation…”

If we were writing an expression for V(s_t) = …, then this would be true, but we are actually writing an equation for V(s_t) \hookleftarrow … .

“...I don't think value-based TD can have a claim on ‘most approaches to deep RL’”

We agree with your concerns here. First of all, we changed “most” to “many.” Also, we added a sentence clarifying the difference between MC tree search and MC value estimation methods in RL (link to commit) (link to commit)

“...Gridworld Playground, I got a ‘maximum exploit’”

From what we have seen, the way in which V(s) is initialized for an unexplored state can vary depending on the implementation. It’s cool that you found this edge case, as it is exactly the sort of thing that the Playground is for.

“...a follow-up on TD(lambda)”

We agree, this would be really cool! A good direction for future work?

Thanks again for reviewing our article. We hope these responses and changes address your concerns and improve the article

distillpub / post--td-paths

Review #1 #4

General Comments