Open awni opened 5 years ago
Hello! First of all, thanks for reading our article and sending us some thoughtful suggestions. In response to your review, we’ve made several changes. Here’s an overall summary (along with some general discussions of our thought process)
“...The “updates toward” operator is not precise in some sense...put such formulas in the footnote”
“...explain the design of diagrams for value function,...”
"averaging over paths” vs “over all possible paths”
“Deep RL” vs “RL”
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer for taking the time to review this article.
General Comments
The article presents an illustrative way to understand various temporal difference learning algorithms. In general, it is rather helpful.
Several concerns. 1) The “updates toward” operator is not precise in some sense, especially for people who know the exact formulas. It may be helpful to put such formulas in the footnote. 2) The diagrams are confusing sometimes, e.g., the two diagrams for SARSA and Expected SARSA are the same. 3) It would be helpful to explain the design of diagrams for value function, Q value function, policy/action section, like what the meaning of different colors and color patterns, as in the illustration of the five algorithms. (This is related to the last point.)
A couple of issues. 1) "It turns out that Monte Carlo is averaging over real trajectories whereas TD learning is averaging over paths. " It may be desirable to say something like "over all possible paths". 2) "Over the last few decades, most work in deep RL has preferred TD learning to Monte Carlo." In fact, "deep RL" is a relative new term in the last few years. It is proper to just say "RL". Another is, policy gradient is also popular; maybe mention it in footnote.
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results
Comments: Starting from the introduction of Q function, some diagrams appear confusing, e.g., the two diagrams for SARSA and Expected SARSA appear the same.
Comments: There may be more reference, or Sutton and Barto would suffice. Sutton & Barto should be in 2018. Reference 3 should be in ICML, and appears not so relevant to the article.