Review #2 - Githubissues

The following peer review was solicited as part of the Distill review process. The review was formatted by the editor to help with readability.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, Guillaume Alain, for taking the time to write such a thorough review.

The paper titled “The Building Blocks of Interpretability” demonstrates a number of ways in which we can visualize the role of elements of an image, and then suggests a unifying view to make life easier for the researcher who has to interpret all of this.

For the first half of the paper, we are given a lot of numbers and pictures. While these are interesting by themselves (the pretty colours sure help in that regards), it’s hard to know what to actually do with those elements. I’m a bit torn because of the fact that I expect Distill papers to be short and on point (which isn’t the case here), but on the flip side everything else is definitely right for a Distill paper (lots of visualization, good exposition with minimal math). The authors go on to say

We will present interfaces that show what the network detects and explain how it develops its understanding, while keeping the amount of information human-scale.

which leads to the question : Was the goal of the first half of the paper just to show how hard is it to use those visualizations? (establishing the problem to propose the solution afterwards)

I’m particularly fond of the second picture in “What Does the Network See”. By the way, is there a reason why the figures are not numbered in any way? It makes them harder to reference.

The concept of “neuron groups” is particularly nice. There is value is recognizing that this whole visualization business is not about single pixels or connected regions. When I read the paper, though, I thought that the authors were going to combine features from different layers. They leave that as a possible, but maybe it’s better to avoid it in order to simplify the visualization.

The authors suggest a certain formal grammar to help us find our way through all those visualization techniques. This is a good idea. I’m sure not that they’ve put their finger on the correct form, though, but don’t have a concrete suggestion to improve it.

They uses nodes and arrows in a table. One would imagine a priori that arrows would be operators and nodes would be objects. But when it comes to “attribution” the nodes are labeled “T”, which is basically just a dummy letter to provide a dummy node on which they can attach the attribution arrow. What they really wanted to link were the boxes in the table, but that would be clumsy. Then we have dotted arrows users for filters, which is nice. But we also have nodes that are stuck on other nodes (like the blue “I” on the green “T” from the “Filter by output attribution”). I’m guessing that being stuck on a node basically corresponds to a “full arrow” that is not being filtered. My point is that this feels more like an informal sketch than a formal way of describing things (which might end up being too limited because new ideas are hard to fit within a constrained framework). It’s worth including in the paper, but I’m not sure it’s as mature as the other ideas in there.

Throughout the reading of this paper, I wondered about how adversarial examples played into all this. I imagine that adversarial examples are not sufficiently well-understood and they would be more of a distraction than anything here. The authors reference the importance of good interfaces when faced with adversarial examples, but they don't really voice an opinion about whether they are an efficient first line of defence against them, or if they are easily fooled.

typo: adverserial -> adversarial

Thanks for the thoughtful review! We've responded inline below.

Was the goal of the first half of the paper just to show how hard is it to use those visualizations?

In the initial sections, we're building up ideas like "semantic dictionaries" and "attribution to hidden units" and interactive interfaces for reifying them. Neuron groups need all those ideas. We also don't see neuron groups as completely superseding the other interfaces. They're probably the most powerful single interface for getting an overview, but if you want to drill deeper it can be helpful to work with other atoms.

I’m particularly fond of the second picture in “What Does the Network See”.

Glad you liked it!

By the way, is there a reason why the figures are not numbered in any way? It makes them harder to reference.

This is a broader issue with the Distill template, tracked as distillpub/template#63.

The authors suggest a certain formal grammar to help us find our way through all those visualization techniques. This is a good idea. I’m sure not that they’ve put their finger on the correct form, though, but don’t have a concrete suggestion to improve it.

As of 3a6af47eba0d62e, we clarify that this is an initial exploration of how one might formalize interpretability interfaces as a grammar. Expressing design spaces as formal grammars is an existing technique used in the HCI and data visualization literature, and we think it’s a powerful way of thinking that we wanted to introduce here.

Throughout the reading of this paper, I wondered about how adversarial examples played into all this.

This would be an exciting direction for future work!

We discuss the relationship briefly when we observe "Of course, just like our models can be fooled, the features that make them up can be too — including with adversarial examples [40]." We think there's a lot of exploration to be done in this direction.

adverserial -> adversarial

Fixed as of 36dbdabbeb9.

distillpub / post--building-blocks

Review #2 #11