cs231n / cs231n.github.io

Public facing notes page
MIT License
10.12k stars 4.06k forks source link

Formula to compute the dimensions of the output volume is unclear #210

Open nbro opened 5 years ago

nbro commented 5 years ago

In this article, you currently say

We can compute the spatial size of the output volume as a function of the input volume size (\(W\)), the receptive field size of the Conv Layer neurons (\(F\)), the stride with which they are applied (\(S\)), and the amount of zero padding used (\(P\)) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by \((W - F + 2P)/S + 1\). For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output. Lets also see one more graphical example

First of all, the input is a volume, so it doesn't make sense to just talk about "input volume size" (or "receptive field size", or "stride", or "amount of zero padding"), given that the width, height and depth of the input volume may have different sizes. You may have implicitly assumed that the width, height and depth of the input volume are all equal to each other. If that's the case, you should have explicitly stated it, because, otherwise, that explanation, is, IMHO, very unclear and confusing.

Furthermore, you say that the formula is (W - F + 2P)/S + 1, and then you give the example of an 7x7 input (so, I suppose,W = 7), 3x3 filter (so, I suppose, F = 3), 1 of stride (so, I suppose, S = 1) and no padding (so, I suppose, P = 0). First of all, I should not have had the need to suppose anything here. You could have made the life of the reader easier by being explicit: why does an input 2D space of 7x7 becomes W = 7? It is not clear: you should have explicitly made the connection to make it clear. Furthermore, before this explanation (that I cited above), you talk about input and output "volumes" and, in this example, you use a 7x7 input which is not a volume. So, this also makes things more confusing. Either you should have used a "real" volume, or you should have explained why, in this example, you used 7x7 input (i.e. a 2D input): that is, it may be the case that the depth is not relevant to compute the "output volume size", but, if that's the case, then you should have been very explicit and you should have stated it clearly. Finally, you should have emphasized (for the distracted reader) that the formula (W - F + 2P)/S + 1 is actually ((W - F + 2P)/S) + 1.