distillpub / post--td-paths

The Paths Perspective on Value Learning
https://distill.pub/2019/paths-perspective-on-value-learning/
Creative Commons Attribution 4.0 International
7 stars 5 forks source link

Review #3 #6

Open awni opened 5 years ago

awni commented 5 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to review this article.


General Comments

The article presents an illustrative way to understand various temporal difference learning algorithms. In general, it is rather helpful.

Several concerns. 1) The “updates toward” operator is not precise in some sense, especially for people who know the exact formulas. It may be helpful to put such formulas in the footnote. 2) The diagrams are confusing sometimes, e.g., the two diagrams for SARSA and Expected SARSA are the same. 3) It would be helpful to explain the design of diagrams for value function, Q value function, policy/action section, like what the meaning of different colors and color patterns, as in the illustration of the five algorithms. (This is related to the last point.)

A couple of issues. 1) "It turns out that Monte Carlo is averaging over real trajectories whereas TD learning is averaging over paths. " It may be desirable to say something like "over all possible paths". 2) "Over the last few decades, most work in deep RL has preferred TD learning to Monte Carlo." In fact, "deep RL" is a relative new term in the last few years. It is proper to just say "RL". Another is, policy gradient is also popular; maybe mention it in footnote.


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 4/5
Outstanding Communication Score
Article Structure 4/5
Writing Style 4/5
Diagram & Interface Style 3/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 4/5

Comments: Starting from the introduction of Q function, some diagrams appear confusing, e.g., the two diagrams for SARSA and Expected SARSA appear the same.

Scientific Correctness & Integrity Score
Are claims in the article well supported? 4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 4/5
How easy would it be to replicate (or falsify) the results? 3/5
Does the article cite relevant work? 4/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 5/5

Comments: There may be more reference, or Sutton and Barto would suffice. Sutton & Barto should be in 2018. Reference 3 should be in ICML, and appears not so relevant to the article.

greydanus commented 5 years ago

Hello! First of all, thanks for reading our article and sending us some thoughtful suggestions. In response to your review, we’ve made several changes. Here’s an overall summary (along with some general discussions of our thought process)

“...The “updates toward” operator is not precise in some sense...put such formulas in the footnote”

“...explain the design of diagrams for value function,...”

"averaging over paths” vs “over all possible paths”

“Deep RL” vs “RL”