Review #3 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to review this article.

General Comments

The article presents an illustrative way to understand various temporal difference learning algorithms. In general, it is rather helpful.

Several concerns. 1) The “updates toward” operator is not precise in some sense, especially for people who know the exact formulas. It may be helpful to put such formulas in the footnote. 2) The diagrams are confusing sometimes, e.g., the two diagrams for SARSA and Expected SARSA are the same. 3) It would be helpful to explain the design of diagrams for value function, Q value function, policy/action section, like what the meaning of different colors and color patterns, as in the illustration of the five algorithms. (This is related to the last point.)

A couple of issues. 1) "It turns out that Monte Carlo is averaging over real trajectories whereas TD learning is averaging over paths. " It may be desirable to say something like "over all possible paths". 2) "Over the last few decades, most work in deep RL has preferred TD learning to Monte Carlo." In fact, "deep RL" is a relative new term in the last few years. It is proper to just say "RL". Another is, policy gradient is also popular; maybe mention it in footnote.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Outstanding Communication	Score
Article Structure	4/5
Writing Style	4/5
Diagram & Interface Style	3/5
Impact of diagrams / interfaces / tools for thought?	3/5
Readability	4/5

Comments: Starting from the introduction of Q function, some diagrams appear confusing, e.g., the two diagrams for SARSA and Expected SARSA appear the same.

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	3/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	5/5

Comments: There may be more reference, or Sutton and Barto would suffice. Sutton & Barto should be in 2018. Reference 3 should be in ICML, and appears not so relevant to the article.

Hello! First of all, thanks for reading our article and sending us some thoughtful suggestions. In response to your review, we’ve made several changes. Here’s an overall summary (along with some general discussions of our thought process)

“...The “updates toward” operator is not precise in some sense...put such formulas in the footnote”

This is something that Reviewer #1 mentioned as well (link to issue)
Here is our rationale for using the \hookleftarrow (↩), and the changes we made to make its meaning more clear (copied from response to Reviewer #1):
- The reason we use an “updates towards” rather than the more specific “+=” notation is that we want this notation to extend to updates to parametric functions, such as gradient updates to a neural network (where the “+=” notation would not be appropriate).
- We realize that this can be confusing, so we included a footnote that reads “In tabular settings such as the Cliff World example, this “updates toward” operator computes a running average. In this case, we can rewrite the n^{th} Monte Carlo update as {equation}. Since this notation is not as clean or general, we chose to use the \hookleftarrow notation instead.”
- We updated this footnote as follows: “In tabular settings such as the Cliff World example, this “update towards” operator computes a running average. More specifically, the n^{th} Monte Carlo update is {equation} and we could just as easily use the “+=” notation. But when using parameteric function approximators such as neural networks, our “update towards” operator may represent a gradient step, which cannot be written in “+=” notation. In order to keep our notation clean and general, we chose to use the \hookleftarrow operator throughout.” (link to commit)

“...explain the design of diagrams for value function,...”

Reviewer #2 had a similar comment about the design of the 10th figure (see issue here). We reworded the caption to clarify (link to commit)

"averaging over paths” vs “over all possible paths”

We made this change here (link to commit)

“Deep RL” vs “RL”

We made this change here (link to commit)

distillpub / post--td-paths

Review #3 #6

General Comments