distillpub / post--td-paths

The Paths Perspective on Value Learning
https://distill.pub/2019/paths-perspective-on-value-learning/
Creative Commons Attribution 4.0 International
7 stars 5 forks source link

Review #1 #4

Open awni opened 5 years ago

awni commented 5 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Gabriel Synnaeve for taking the time to review this article.


General Comments

Before all this (hopefully constructive) criticism, let me say that the article presentation is awesome.

I think the curly left arrow (↩) may be misleading and would be better replaced with a "+=" sign, or writing that it's a delta/a gradient/an update... In all the article it's consistently meaning "updated with", but a left arrow often means "assign".

There is a somewhat inconsistent (not incorrect!) treatment/presentation of discount factors.

"A pleasant correspondence has emerged." -> there is an outer expectation missing in your explannation, to really make clear that in your example you average over both trajectories and that how $V{MC}(s=(2,3)) = 1/(N=2) sum{i=0}^{N-1} r_i$ (coordinates: x left to right, y bottom to top). I get that it may complicate the message too much at this point, maybe this averaging can be shown under the first cliff world figure of the introduction.

"Over the last few decades, most work in deep RL has preferred TD learning to Monte Carlo. Indeed, most approaches to deep RL use TD-style value updates. That said, a lot of work goes into making these methods more stable." -> I would not be that unequivocal: AlphaGo's and AG Zero's value models are trained on the game outcome (yes they're forwarded/n-stepped in time through the tree search, but it's a Monte Carlo). There have been enough successes of (pure) policy-based training and even zero order methods, that I don't think value-based TD can have a claim on "most approaches to deep RL".

In the final Gridworld Playground, I got a "maximum exploit" (slider all to the left) Monte Carlo agent to get stuck (e.g. in cell (0, 2) trying to go to the left or a loop between (0, 0) and (0, 1)). It seemed like there was no randomness in selecting the max if two or more actions are maximum at the same value, but I checked your code and Playground.js:greedy_as(s) looks correct, so I debugged and it seems to be because you do not initialize the Q values all at the same value (e.g. 0) so there is one that is randomly the max and in greedy_as(s) you will always have only one element in the optimal_as array. A slider with gamma here would make sense. BTW you could link to the javascript code under the figures or at the end of the article, most of it is clear enough.

From the current content and presentation, it would be very easy to extend to do a follow-up on TD(lambda).

Overall the article is very clear and explains powerful concepts of reinforcement learning.


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 4/5
Outstanding Communication Score
Article Structure 3/5
Writing Style 4/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 5/5
Readability 4/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 4/5
How easy would it be to replicate (or falsify) the results? 5/5
Does the article cite relevant work? 3/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 4/5
greydanus commented 4 years ago

Hello Gabriel, thanks for taking the time to review our article and offer some suggestions. We believe they are really going to help the article! Below you will find detailed responses to some of the issues you raised.

“...curly left arrow (↩) may be misleading...”

“...somewhat inconsistent...presentation of discount factors…” and “...a sentence explaining early on why we would wish to discount future rewards…” and “...label and point to gamma as the discount factor?...”

“...there is an outer expectation missing in your explanation…”

“...I don't think value-based TD can have a claim on ‘most approaches to deep RL’”

“...Gridworld Playground, I got a ‘maximum exploit’”

“...a follow-up on TD(lambda)”

Thanks again for reviewing our article. We hope these responses and changes address your concerns and improve the article