Closed distillpub-reviewers closed 5 years ago
We thank the reviewer for the thorough review and great suggestions! We will address these comments in batches, since there are many things to address/discuss. Edits mentioned in this post were checked-in as part of this commit.
In terms of contribution significance, Dang-Ha (2017) has derived an equivalent way of computing receptive fields for single-path CNNs
It is correct that Dang-Ha (2017) derived an algorithm to compute receptive fields for single-path CNNs. But we would like to point out that they do not provide a full mathematical derivation for this case; in contrast, our paper provides a full derivation, with a closed-form expression for the single-path case.
The topic offers an excellent opportunity to clarify and illustrate what a receptive field is, but the submission is currently lacking a definition, formal or informal. A few definitions have been proposed that could be reused
Great point, very good suggestion. We believe an informal definition is the best choice in this case, we added it to the second paragraph (which was accordingly rephrased).
The motivation for studying receptive fields could be strengthened. I find the following sentence a bit confusing: “For example, for the computer vision application of object detection, it is important to understand a convolutional feature’s spatial span in order to represent objects over multiple scales.” Why is it important to represent objects over multiple scales? In what ways does understanding a convolutional feature’s spatial span help us towards that goal?
Thanks for the feedback. We rephrased it in order to make it more accessible and clarify why receptive field computation is helpful in this case.
there are a few articles online and on arXiv which should be acknowledged
We integrated this into the text. We now discuss Dang-Ha (2017) and Le & Borji (2018) in the third paragraph. We also expanded the last paragraph to mention more details from Luo et al. (2016), as suggested.
The submission could also make a better effort at citing work relevant to feature visualization, theoretical guarantees, model interpretability, and generalization, or at least point to review papers where readers could find a more thorough literature review.
We integrated other recent work in these areas.
Here is a second batch of edits, checked-in as part of this commit.
“What is the effect of dilation on the size of receptive fields? Why would we want to use it instead of larger convolutional kernels?”, “Does upsampling reduce the size of the receptive field?”, “If we define the receptive field in terms of the features in the input which affect a given output feature, is the receptive field of a batch normalization layer at training time the whole input?”
We added a paragraph in the main text to mention the case of other operations, and added a new appendix section to discuss them.
I’m not sure it’s enough evidence to support assertions like “large receptive fields are necessary for high-level recognition tasks”, “networks which can efficiently generate large receptive fields may enjoy enhanced recognition performance.”, or “the network’s receptive field seems to be a necessary component [for improved performance]”. The text should make it clearer that this is a conjecture. This would also be a great opportunity to discuss what sort of experiments could be performed to verify it.
Yes, these are conjectures and not definitive proofs. We have mainly used the terminology “suggests” to indicate that this is a conjecture, and have also rephrased the third case to make it clearer that it is a conjecture. These three sentences now read like:
We also added a footnote to that paragraph to discuss experiments that can be done to help shed light on this hypothesis.
Here are the remaining edits, which were done as part of the 6 most recent commits.
Most of the text in the “Problem setup” section could be replaced by visually representing it in a figure.
We added a new figure to support the explanation of these parameters, as suggested.
"As a simple example, consider layer L, which takes features f{L−1} as input, and generates f_L as output. It is easy to see that kL features from f{L−1} can influence one feature from f_L , since each feature from f_L is directly connected to kL features from f{L−1}. So, r_{L−1} = kL.” This paragraph is best understood when visualized and would benefit from an accompanying figure.
We added a new figure to support this explanation, as suggested.
"For the case where kl > 1, we just need to add k{l−1} features, which will cover those from the left and the right of the region. For example, if we use a kernel size of 5 (k_l = 5), there would be 2 extra features used on each side, adding 4 in total.” The textual example here could be replaced with a figure and some embedded description (as is common in Distill articles).
We added a new figure to support this explanation, as suggested.
Since we have addressed all suggestions from the reviewer, I am closing this issue now. But please feel free to reopen if you have any further questions/suggestions/comments :)
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer for taking the time to write such a thorough review.
On Advancing the Dialogue
In terms of contribution significance, Dang-Ha (2017) has derived an equivalent way of computing receptive fields for single-path CNNs; the novelty in this submission lies in the extension of the derivation to arbitrary computation graphs and the discussion on potential alignment issues that can arise. Most of the benefits of being able to compute the location and size of receptive fields in CNNs in terms of improving the understanding of these models are theoretical at the moment. The submission can increase the significance of its contributions either by introducing a richer set of analyses to showcase what insights can be gained by looking at receptive fields, or by putting more emphasis on the educational aspects of the article.
On Outstanding Communication
I think this is the area where the submission has the most room for improvement.
The submission is well-written, but notation-heavy at times and could rely more heavily on visual support. The ideas discussed require some thinking on the reader’s part, but I feel that there should be a way to reduce friction by replacing some of the math with appropriate figures. More concretely:
The topic offers an excellent opportunity to clarify and illustrate what a receptive field is, but the submission is currently lacking a definition, formal or informal. A few definitions have been proposed that could be reused:
I feel many concepts in the submission would be better explained visually. In general I would recommend giving each figure a single and clearly-defined purpose. A few examples:
k_l > 1
, we just need to addk_{l−1}
features, which will cover those from the left and the right of the region. For example, if we use a kernel size of 5 (kl = 5), there would be 2 extra features used on each side, adding 4 in total.” The textual example here could be replaced with a figure and some embedded description (as is common in Distill articles).The motivation for studying receptive fields could be strengthened. I find the following sentence a bit confusing: “For example, for the computer vision application of object detection, it is important to understand a convolutional feature’s spatial span in order to represent objects over multiple scales.” Why is it important to represent objects over multiple scales? In what ways does understanding a convolutional feature’s spatial span help us towards that goal?
On Scientific Correctness & Integrity
Most of the statements made in the submission are in the form of equations for the size and location of receptive fields in CNNs. Following along the derivations in the Appendix, they appear to be correct.
There is not that much relevant work to cite regarding the computation of receptive fields in CNNs, but there are a few articles online and on arXiv which should be acknowledged:
The submission could also make a better effort at citing work relevant to feature visualization, theoretical guarantees, model interpretability, and generalization, or at least point to review papers where readers could find a more thorough literature review.
The effect of certain modern architectural features of CNNs (like transposed, upsampled, dilated, and separable convolutions) on receptive fields is not mentioned in the submission. Covering all possible convolutional layers may be beyond the scope of the submission, but a more thorough discussion would strengthen it. Here are some questions a curious reader might ask:
As the submission itself points out, we should be careful in relating the growth in receptive fields to increased classification accuracy, as there are many possible confounding factors (network depth being an obvious one). The relationship itself is interesting, but I’m not sure it’s enough evidence to support assertions like “large receptive fields are necessary for high-level recognition tasks”, “networks which can efficiently generate large receptive fields may enjoy enhanced recognition performance.”, or “the network’s receptive field seems to be a necessary component [for improved performance]”. The text should make it clearer that this is a conjecture. This would also be a great opportunity to discuss what sort of experiments could be performed to verify it.
REFERENCES:
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: The reviewer asked the editorial team to make this decision. The editorial team determined there was no conflict of interest. What type of contributions does this article make?: Explanation of existing results