distillpub / post--td-paths

The Paths Perspective on Value Learning
https://distill.pub/2019/paths-perspective-on-value-learning/
Creative Commons Attribution 4.0 International
7 stars 5 forks source link

Review #2 #5

Open awni opened 5 years ago

awni commented 5 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Erich Elsen for taking the time to review this article.


General Comments

Mentions Go as canonical example of RL success, but it uses Monte Carlo Tree Search - yet the article makes it seem like Monte Carlo is hardly ever the right thing to use. These aren't exactly the same thing, but the distinction might be lost on people who lock onto the Monte Carlo bit. Explaining all of AlphaGo clearly too much for the article, but something to clarify the point for the lay reader who notices the “Monte Carlo” in both places and wonders what is going on might be helpful.

I find Example 6.4 of the 1st edition of Sutton & Barto to also provide useful intuition on the differences between Monte Carlo and TD.

Minor nit: Falling off the cliff usually has a much more negative reward, -10?

I would find it more natural to write the expansion of the Value function into the reward at (t+1) and the discounted V at (t+2) instead at t and t+1. I understand the motivation was to allow it to be directly plugged into the next equation, but it requires generalization on the next line in any case.

The diagram where you can change the reward of the “blue” square is a bit confusing. There are really two colors of blue in the diagram at the default settings (although one is “more” blue than the other. English lets us down here, Russian wouldn’t have this problem.) The second confusing bit is that the default value on the slider is .5, but the value shown in the diagram is always 2. These two things combined made me slightly confused at first until I realized I was changing the value of going to the “blue” circle.

The section on double Q-learning makes me think that learning distributions rather than just means would provide more information, as then one could estimate variance of paths in addition to expectations. I feel like this must already have been done. Perhaps worth a mention, since if I thought of that immediately, likely others will as well.

It would be nice if there was a way to speed up the gridworld playground at the end. Agents can wander around for quite some time before hitting the goal or the cliff and collecting enough experience to converge is a bit tedious. Maybe just add a button that says “Add 5 agents”. Or to give you an idea on how efficient the different algorithms are, make visible the amount of experience collected so far. And allow experience to be reset?


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 4/5
Outstanding Communication Score
Article Structure 5/5
Writing Style 5/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 4/5
Readability 5/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 5/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 5/5
How easy would it be to replicate (or falsify) the results? 5/5
Does the article cite relevant work? 4/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 5/5
greydanus commented 5 years ago

Hi Erich. First of all, thanks for reading our article and sending us some well-thought-out comments and suggestions. It’s great to get some outside perspective on what parts of the article need work/re-wording. In response to your review, we’ve made several changes and also addressed some of your concerns point-by-point:

“...Go as canonical example of RL success, but it uses Monte Carlo Tree Search…”

“...write the expansion of the Value function into the reward at (t+1) and the discounted V at (t+2) instead at t and t+1.”

“...The diagram where you can change the reward of the “blue” square is a bit confusing”

“...learning distributions rather than just means would provide more information”

Thanks again for reviewing our article. We hope these responses and changes address your concerns and improve the article!