Review #2 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

General Comments

Paper summary:

Several feature attribution methods rely on an additional input (besides the one being explained) called the “baseline”. The paper discusses how the choice of baseline impact the attributions for an input, and proposes the idea of averaging over several baselines when good individual choices do not exist. It does this in the context of the specific attribution method called “Integrated Gradients” and the specific task of object recognition on the ImageNet dataset.

Pros:

The paper is very well-written an easy to follow. It offers a very nice exposition of the Integrated Gradients method. The interactive visualization immensely help with understanding the various ideas
The paper tackles the important and thorny issue of picking baselines in feature attribution methods. The visualization that allows choosing different segments of the input image as a baseline is very clever. It makes the sensitivity of the attributions to the choice of baselines very apparent.

Cons:

The paper views the baseline as a mere implementation detail of Integrated Gradients (and other feature attribution methods). This is a bit misleading. The Integrated Gradients paper considers the baseline to be a part of the attribution problem statement. The various axioms are also defined for the pair of input and baseline. In that sense, Integrated Gradients posits that one must commit to a baseline while formulating the attribution problem.
It would help to have more discussion on properties of Expected Gradients (and more generally of the idea of “averaging over baselines”). It is also not clear if one must simply average the attributions across different baselines. Instead, one may study the distribution over attributions to identify differnet patterns, say via clustering. (See the next section for more suggestions.)

Suggestions:

Below are some suggestions on improving / extending this paper:

The idea of averaging over several baselines seems quite general, and so the paper could be greatly strengthened by including an additional example (preferable for a task on text or tabular inputs)
It would help to discuss what axioms do Expected Gradients satisfy? Is there a new completeness axiom to tell us that we have taken enough background samples?
Computing Expected Gradients involves computing the average attribution relative to a random sample of baseline points. The sampling brings uncertainty, and I wonder if the authors considered quantifying the uncertainty with confidence intervals?
An attractive property of the black baseline is that it is encoded as zero, and therefore it is clear how to interpret the sign of the attribution — positive attribution means that the model prefers the pixel to be brighter. If the baseline is non-zero then the sign of the attribution is harder to interpret. A positive attribution would mean that the model prefers for the pixel to move away from the baseline. This may mean making the pixel brighter or darker depending on which side of the baseline the pixel lies. The problem is exacerbated when several different baselines are considered. It would help if the authors comment on interpreting the sign of the attributions.
While the formalism discussed in the paper assumes a certain input distribution D, in practice, we only have a certain sample of the distribution. Often the sample may not be representative. In such cases, I worry that artifacts of the sample may creep into the Expected Gradients. It would help if the authors comment on this.
When considering multiple baselines it could be that the attribution to a pixel is positive for some baselines and negative for some others, and the average attribution ends up being near zero. In such cases, I wonder if the expectation is right summarization of the distribution of attributions across different baselines? Instead, one could consider clustering the attributions (from different baselines) to separate the different patterns at play.
The idea of averaging gradients across a sample of points is also used by SmoothGrad (https://arxiv.org/abs/1706.03825). Is there a formal connection between Expected Gradients and SmoothGrad?

Minor:

In the second to last figure, what is the value of alpha used for parts (2) and (4)?

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Outstanding Communication	Score
Article Structure	4/5
Writing Style	4/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	4/5
Readability	4/5

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	2/5
How easy would it be to replicate (or falsify) the results?	4/5
Does the article cite relevant work?	3/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	3/5

Thank you for the detailed comments! Based on your feedback, we’ve made some changes to the article and added several new sections. In particular:

“The paper views the baseline as a mere implementation detail of Integrated Gradients (and other feature attribution methods). This is a bit misleading. The Integrated Gradients paper considers the baseline to be a part of the attribution problem statement. The various axioms are also defined for the pair of input and baseline. In that sense, Integrated Gradients posits that one must commit to a baseline while formulating the attribution problem.”

This is an important point. In our first version of the article, I think we presented some issues regarding integrated gradients in a manner that seemed like they were flaws with the original method, rather than design choices. Our most recent writing attempts to address this by presenting a more nuanced picture of each baseline choice and especially by shifting the discussion of problems to be about the baseline choice rather than the method integrated gradients itself. I am open to even more suggestions about how to improve in this direction.

“It would help to have more discussion on properties of Expected Gradients (and more generally of the idea of “averaging over baselines”). It is also not clear if one must simply average the attributions across different baselines. Instead, one may study the distribution over attributions to identify differnet patterns, say via clustering. (See the next section for more suggestions.)” “It would help to discuss what axioms do Expected Gradients satisfy? Is there a new completeness axiom to tell us that we have taken enough background samples?”

We considered for a long time adding expanded discussion about the axioms that integrated gradients satisfies, and ended up omitting them from our most recent draft. We feel that an extending discussion of those axioms detract from the main point of the article, which was intended to be focused around the idea of missingness. With that said, we added a footnote about how all of the various baselines we present, including those that are distributions, satisfy the same axioms integrated gradients does.

The idea of seeing which baselines generate which types of patterns in the attributions is a really interesting open question, and one we are particularly interested in thinking about. We leave it to future work :)

“While the formalism discussed in the paper assumes a certain input distribution D, in practice, we only have a certain sample of the distribution. Often the sample may not be representative. In such cases, I worry that artifacts of the sample may creep into the Expected Gradients. It would help if the authors comment on this.” “When considering multiple baselines it could be that the attribution to a pixel is positive for some baselines and negative for some others, and the average attribution ends up being near zero. In such cases, I wonder if the expectation is right summarization of the distribution of attributions across different baselines? Instead, one could consider clustering the attributions (from different baselines) to separate the different patterns at play.”

Related to the point above: I think there are many ways to expand the discussion around path methods and many ways to improve the method. Again, for the sake of trying to limit the scope of the article, we will leave them to future work! I do think that the questions you raise are very compelling.

“The idea of averaging over several baselines seems quite general, and so the paper could be greatly strengthened by including an additional example (preferable for a task on text or tabular inputs)”

For the sake of scope, we don’t include additional data types in this article, especially since they would require significant additional work to visualize. I do agree with you though: the idea of averaging over multiple baselines should be fairly general, and would hope to see future work in this direction.

“Computing Expected Gradients involves computing the average attribution relative to a random sample of baseline points. The sampling brings uncertainty, and I wonder if the authors considered quantifying the uncertainty with confidence intervals?”

This is another really good point that we don’t directly address in the article. I fear that doing so would open up a large can of worms about whether or not you can trust attributions that are generated by a stochastic process (I believe you can). However, I am interested in this question as well and hope to pursue it in the future.

“An attractive property of the black baseline is that it is encoded as zero, and therefore it is clear how to interpret the sign of the attribution — positive attribution means that the model prefers the pixel to be brighter. If the baseline is non-zero then the sign of the attribution is harder to interpret. A positive attribution would mean that the model prefers for the pixel to move away from the baseline. This may mean making the pixel brighter or darker depending on which side of the baseline the pixel lies. The problem is exacerbated when several different baselines are considered. It would help if the authors comment on interpreting the sign of the attributions.”

I’ve put a fair bit of thought into this and I can’t quite convince myself that it is true. As long as the baseline has lower network output than the explained input, doesn’t the sign retain it’s meaning? That is, as long as f(x) - f(x’) > 0, then I think that positive attributions means increase in output because we increase the output as we move along the path from x’ to x. I would have to formalize this intuition and run experiments to be sure.

In general, we somewhat dodge the issue of sign in this article. I know that it’s a large omission, but it just doesn’t fit in with the rest of the article. A discussion about the sign of attributions for path methods is a much needed discussion, but I can’t find a way to elegantly include it here.

“The idea of averaging gradients across a sample of points is also used by SmoothGrad (https://arxiv.org/abs/1706.03825). Is there a formal connection between Expected Gradients and SmoothGrad?”

Aha! There is! Based on this feedback, our new section “Expectations, and Connections to SmoothGrad” discusses this in detail.

I hope that our new version addresses some of your concerns, especially the concerns regarding the mis-characterization of the original integrated gradients method. I feel this is an important issue and I don’t want to portray integrated gradients in an unnecessarily negative light.

distillpub / post--attribution-baselines