Review #2 - Githubissues

distillpub-reviewers commented 5 years ago

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to write such a thorough review.

On Advancing the Dialogue

In terms of contribution significance, Dang-Ha (2017) has derived an equivalent way of computing receptive fields for single-path CNNs; the novelty in this submission lies in the extension of the derivation to arbitrary computation graphs and the discussion on potential alignment issues that can arise. Most of the benefits of being able to compute the location and size of receptive fields in CNNs in terms of improving the understanding of these models are theoretical at the moment. The submission can increase the significance of its contributions either by introducing a richer set of analyses to showcase what insights can be gained by looking at receptive fields, or by putting more emphasis on the educational aspects of the article.

On Outstanding Communication

I think this is the area where the submission has the most room for improvement.

The submission is well-written, but notation-heavy at times and could rely more heavily on visual support. The ideas discussed require some thinking on the reader’s part, but I feel that there should be a way to reduce friction by replacing some of the math with appropriate figures. More concretely:

The topic offers an excellent opportunity to clarify and illustrate what a receptive field is, but the submission is currently lacking a definition, formal or informal. A few definitions have been proposed that could be reused:
- Luo et al. (2016): “Unlike in fully connected networks, where the value of each unit depends on the entire input to the network, a unit in convolutional networks only depends on a region of the input. This region in the input is the receptive field for that unit.”
- Dang-Ha (2017): “The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by).”
- Le & Borji (2018): “Receptive Field (RF): is a local region (including its depth) on the output volume of the previous layer that a neuron is connected to. [...] Effective Receptive Field (ERF): is the area of the original image that can possibly influence the activation of a neuron. One important point to notice here is that RF and ERF are the same for the first convolutional layer. However, they differ as we [move] along the CNN hierarchy.”
I feel many concepts in the submission would be better explained visually. In general I would recommend giving each figure a single and clearly-defined purpose. A few examples:
- Most of the text in the “Problem setup” section could be replaced by visually representing it in a figure. The figure at the end of the section somewhat serves that purpose, but has many moving parts and requires the reader to interact with it to reveal the purpose of all four variables mentioned in the text.
- ”As a simple example, consider layer L, which takes features f{L−1} as input, and generates f_L as output. It is easy to see that kL features from f{L−1} can influence one feature from f_L , since each feature from f_L is directly connected to kL features from f{L−1}. So, r_{L−1} = kL.” This paragraph is best understood when visualized and would benefit from an accompanying figure.
- “_For the case where k_l > 1, we just need to add k_{l−1} features, which will cover those from the left and the right of the region. For example, if we use a kernel size of 5 (kl = 5), there would be 2 extra features used on each side, adding 4 in total.” The textual example here could be replaced with a figure and some embedded description (as is common in Distill articles).
The motivation for studying receptive fields could be strengthened. I find the following sentence a bit confusing: “For example, for the computer vision application of object detection, it is important to understand a convolutional feature’s spatial span in order to represent objects over multiple scales.” Why is it important to represent objects over multiple scales? In what ways does understanding a convolutional feature’s spatial span help us towards that goal?

On Scientific Correctness & Integrity

Most of the statements made in the submission are in the form of equations for the size and location of receptive fields in CNNs. Following along the derivations in the Appendix, they appear to be correct.

There is not that much relevant work to cite regarding the computation of receptive fields in CNNs, but there are a few articles online and on arXiv which should be acknowledged:

Luo et al. (2016) introduce a notion of “effective receptive field” using partial derivatives. The article is cited for its observation that effective receptive fields in CNNs have a Gaussian shape, but it also provides a mathematical definition and a procedure for measuring those effective receptive fields.
Dang-Ha (2017) discusses “receptive field arithmetic” for single-path CNNs, derives an equation for the receptive field size, and provides a sample Python program for computing receptive field sizes. The equation is derived from a bottom-up perspective, but is equivalent to Equation 2 in the submission.
Le & Borji (2018) discuss bottom-up and top-down approaches to computing receptive field sizes.

The submission could also make a better effort at citing work relevant to feature visualization, theoretical guarantees, model interpretability, and generalization, or at least point to review papers where readers could find a more thorough literature review.

The effect of certain modern architectural features of CNNs (like transposed, upsampled, dilated, and separable convolutions) on receptive fields is not mentioned in the submission. Covering all possible convolutional layers may be beyond the scope of the submission, but a more thorough discussion would strengthen it. Here are some questions a curious reader might ask:

What is the effect of dilation on the size of receptive fields? Why would we want to use it instead of larger convolutional kernels?
Does upsampling reduce the size of the receptive field?
If we define the receptive field in terms of the features in the input which affect a given output feature, is the receptive field of a batch normalization layer at training time the whole input?

As the submission itself points out, we should be careful in relating the growth in receptive fields to increased classification accuracy, as there are many possible confounding factors (network depth being an obvious one). The relationship itself is interesting, but I’m not sure it’s enough evidence to support assertions like “large receptive fields are necessary for high-level recognition tasks”, “networks which can efficiently generate large receptive fields may enjoy enhanced recognition performance.”, or “the network’s receptive field seems to be a necessary component [for improved performance]”. The text should make it clearer that this is a conjecture. This would also be a great opportunity to discuss what sort of experiments could be performed to verify it.

REFERENCES:

Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (pp. 4898-4906).
Dang-Ha, T. H. (2017). A guide to receptive field arithmetic for Convolutional Neural Networks. https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807
Le, H., & Borji, A. (2017). What are the receptive, effective receptive, and projective fields of neurons in convolutional neural networks? arXiv:1705.07049.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: The reviewer asked the editorial team to make this decision. The editorial team determined there was no conflict of interest. What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	3/5

Outstanding Communication	Score
Article Structure	4/5
Writing Style	3/5
Diagram & Interface Style	3/5
Impact of diagrams / interfaces / tools for thought?	2/5
Readability	3/5

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	3/5
How easy would it be to replicate (or falsify) the results?	4/5
Does the article cite relevant work?	2/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	4/5

andrefaraujo commented 5 years ago

We thank the reviewer for the thorough review and great suggestions! We will address these comments in batches, since there are many things to address/discuss. Edits mentioned in this post were checked-in as part of this commit.

In terms of contribution significance, Dang-Ha (2017) has derived an equivalent way of computing receptive fields for single-path CNNs

It is correct that Dang-Ha (2017) derived an algorithm to compute receptive fields for single-path CNNs. But we would like to point out that they do not provide a full mathematical derivation for this case; in contrast, our paper provides a full derivation, with a closed-form expression for the single-path case.

The topic offers an excellent opportunity to clarify and illustrate what a receptive field is, but the submission is currently lacking a definition, formal or informal. A few definitions have been proposed that could be reused

Great point, very good suggestion. We believe an informal definition is the best choice in this case, we added it to the second paragraph (which was accordingly rephrased).

The motivation for studying receptive fields could be strengthened. I find the following sentence a bit confusing: “For example, for the computer vision application of object detection, it is important to understand a convolutional feature’s spatial span in order to represent objects over multiple scales.” Why is it important to represent objects over multiple scales? In what ways does understanding a convolutional feature’s spatial span help us towards that goal?

Thanks for the feedback. We rephrased it in order to make it more accessible and clarify why receptive field computation is helpful in this case.

there are a few articles online and on arXiv which should be acknowledged

We integrated this into the text. We now discuss Dang-Ha (2017) and Le & Borji (2018) in the third paragraph. We also expanded the last paragraph to mention more details from Luo et al. (2016), as suggested.

The submission could also make a better effort at citing work relevant to feature visualization, theoretical guarantees, model interpretability, and generalization, or at least point to review papers where readers could find a more thorough literature review.

We integrated other recent work in these areas.

andrefaraujo commented 5 years ago

Here is a second batch of edits, checked-in as part of this commit.

“What is the effect of dilation on the size of receptive fields? Why would we want to use it instead of larger convolutional kernels?”, “Does upsampling reduce the size of the receptive field?”, “If we define the receptive field in terms of the features in the input which affect a given output feature, is the receptive field of a batch normalization layer at training time the whole input?”

We added a paragraph in the main text to mention the case of other operations, and added a new appendix section to discuss them.

I’m not sure it’s enough evidence to support assertions like “large receptive fields are necessary for high-level recognition tasks”, “networks which can efficiently generate large receptive fields may enjoy enhanced recognition performance.”, or “the network’s receptive field seems to be a necessary component [for improved performance]”. The text should make it clearer that this is a conjecture. This would also be a great opportunity to discuss what sort of experiments could be performed to verify it.

Yes, these are conjectures and not definitive proofs. We have mainly used the terminology “suggests” to indicate that this is a conjecture, and have also rephrased the third case to make it clearer that it is a conjecture. These three sentences now read like:

“which suggests that large receptive fields are necessary for high-level recognition tasks, but with diminishing rewards”
“This suggests that networks which can efficiently generate large receptive fields may enjoy enhanced recognition performance”
“In other words, while we conjecture that a large receptive field is necessary, by no means it is sufficient.”

We also added a footnote to that paragraph to discuss experiments that can be done to help shed light on this hypothesis.

andrefaraujo commented 5 years ago

Here are the remaining edits, which were done as part of the 6 most recent commits.

Most of the text in the “Problem setup” section could be replaced by visually representing it in a figure.

We added a new figure to support the explanation of these parameters, as suggested.

"As a simple example, consider layer L, which takes features f{L−1} as input, and generates f_L as output. It is easy to see that kL features from f{L−1} can influence one feature from f_L , since each feature from f_L is directly connected to kL features from f{L−1}. So, r_{L−1} = kL.” This paragraph is best understood when visualized and would benefit from an accompanying figure.

We added a new figure to support this explanation, as suggested.

"For the case where kl > 1, we just need to add k{l−1} features, which will cover those from the left and the right of the region. For example, if we use a kernel size of 5 (k_l = 5), there would be 2 extra features used on each side, adding 4 in total.” The textual example here could be replaced with a figure and some embedded description (as is common in Distill articles).

We added a new figure to support this explanation, as suggested.

andrefaraujo commented 5 years ago

Since we have addressed all suggestions from the reviewer, I am closing this issue now. But please feel free to reopen if you have any further questions/suggestions/comments :)

distillpub / post--receptive-field

Review #2 #7

On Advancing the Dialogue

On Outstanding Communication

On Scientific Correctness & Integrity