Review #1 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

General Comments

This paper combined several techniques (dimensionality reduction) to inspect the visual features learned in a reinforcement learning (RL) model. The main hypothesis made in the paper is the diversity hypothesis: if the RL models are trained with more diverse environments, the RL models will become more interpretable. To support this hypothesis, this paper used the procedurally-generated video game environment CoinRun as the research platform and took advantage of feature visualization techniques to identify visually interpretable focal points of the model. The interfaces provided in the paper have a rich collection of visual examples on different games trained with different experimental settings, which well support the hypothesis and are very helpful to help readers understand the claim.

This paper is well-written and provides many visual examples to help explain the ideas. Below are some suggestions on improving the writing:

It would be more intuitive for readers who are not familiar with CoinRun to play the game themselves via keyboard control. I suggest adding an interactive interface at the beginning of the paper. This helps readers understand how easy/difficult it is to play coinrun and how good the controls are.
For superscript 2, it’s worthwhile to point out that the darkness of the block represents the magnitude of the velocity.
Can you include some failed rollouts in the “Model Analysis” section to show that value function can tell that the agent will die before the episode ends?
It’s not straight forward what the “attribution channel totals” mean, what y-axis represents. A better caption or explanation is needed.
It would be easier to see the action choice if the next action is directly shown on the policy probabilities. For example, using a different color for the next action on the probability distribution would be more straight-forward.
As mentioned in the “Dissecting failure” section, the failure is due to the lack of memory and stochastic sampling. Have you tried to do the same analysis while the agent always picks the action that has the highest probability (greedy policy) from the policy distribution in the test time? And have you tried an RNN policy? Does these help reduce failure cases?
It can help readers (especially readers who are not familiar with the CoinRun environment) see clearly what are in the scene if a full resolution observation is added next to the compressed observation.
The “Landing platform moving off-screen” example is not very convincing. As even in the first few frames, the platform on the right is totally visible in the view. The agent fails here because it jumps too early. And when the agent jumps to the air, no matter what action it takes, the trajectory is barely affected in CoinRun. If the agent can move to the right one more step before it starts jumping, it will succeed. So it doesn’t seem that the agent fails here because the platform moves out of view.
In the superscript 8, it’s mentioned that the features are obtained by applying attention-based NMF to layer 2b of our model. But the paper didn’t provide a detailed description of the network structure. So it’s not clear which layer they used for the analysis.

Overall, I think this paper provides a valuable method and example of understanding visual features learned in an RL model and their interpretability. It’s a valuable contribution to the RL community.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Outstanding Communication	Score
Article Structure	4/5
Writing Style	3/5
Diagram & Interface Style	3/5
Impact of diagrams / interfaces / tools for thought?	5/5
Readability	4/5

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	5/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	4/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	4/5

We are extremely grateful to the reviewer for their thoughtful comments. We have made a number of changes thanks to their suggestions.

We would love to add this, but we are unfortunately not aware of a straightforward way to port the game from Python to the browser. We have therefore added a link in the first footnote to instructions for playing the game using Python.
Fixed.
We agree that this would be helpful, but we wanted to show a trajectory that wasn't cherry-picked here.
Fixed.
Fixed.
These questions are interesting and well-motivated. We did not study them, but would love to do so in future research.
We agree, but we thought it would be good to display the downsampled observations to make it clear what the model is seeing. Ideally we would make it possible to toggle between full-resolution observations and downsampled observations, but this would require significant further effort.
We can see why this example is less convincing. However, we do believe that our interpretation is the correct one. This is because the agent is extremely well-calibrated at timing its jumps, having done so millions of times in training. If it had jumped at the very edge of the platform, it would have been much more likely to land on the lower platform on the far side, where there could have been enemies. Moreover, the agent misses the jump by the tiniest of margins, so that the unlucky action sampling is just enough to cause the agent to fail to make the jump. To make this example more convincing, we have modified the explanation of Timestep 2 to make it clear why the agent seems to be jumping early.
The network structure was previously given in a footnote, which was hard to notice. We have converted this footnote into an appendix and referenced it more prominently, as well as linking to it whenever layer 2b is mentioned.

distillpub / post--understanding-rl-vision

Review #1 #6

General Comments