Inception-ResNet-v2 is not Inception-ResNet-v2 from the paper (but it does work)

As #197 and #206 have pointed out, there are some inconsistencies with the implementation of Inception-ResNet-v2 and the paper. @agsourav highlights that the stem seems to be that of Inception-ResNet-v1 while @Rupesh-rkgit in #206 suggests that the entire schema implemented is actually that of Inception-v4. Having used this model myself, I started digging in and found things to be even messier that that...it actually goes as far as the paper itself being inconsistent. In the interest of having all the issues in on place, here's the summary of what I have found:

Stem

The stem does closely match Inception-ResNet-v1 at first, but the final layer of that stem is a 3x3 Conv (256 stride 2). That layer is instead replaced by a 3x3 max pool layer that then feeds into a branched structure that does not match anything in the paper. This branched structure is a pure inception module with a final output depth of 320.

Inception-ResNet-A

This block, by the two proposed schema (Fig. 9 and 15) should either be repeated 4 or 5 times. In the code (block35), it is repeated 10 times! The block architecture is definitely passing residuals, so it is not pure inception. In fact, it does match the structure for Inception-ResNet-v2 given in the paper, with one important exception, the input-output depth is different. This inconsistency is ultimately forced by the different output depth of the stem. The final output depth of this block (by necessity due to repeating) is identical to the output depth of the preceding block, 320 in the code and 384 in the paper.

Reduction-A

Ignoring depth differences, the structure of this block in the code (mixed_6a) does match Inception-Resnet-v2 as described by the paper (Fig. 7 and Table 1) where k = 256, l = 256, m = 384, n = 384. The code outputs a depth of 320 + m + n = 1088. The paper output depth would be 384 + m + n = 1152. Remember those numbers, as they define the input depth of our next block.

Inception-ResNet-B

The structure of this block in the code (block17) matches the structure in the paper (Fig. 17), generally, but things start to get really messy with filter counts here. This block is repeated -- the paper says it should be 10 times for Inception-ResNet-v2 but the code does it 20 times -- so the output of the block must be the same depth as the input, which must be the same as the output of the preceding block. We remember that the output from the code worked out to be 1088, while the output from the paper should be 1152. Well, the code is consistent and this block takes and gives a depth of 1088. The paper, however, is now inconsistent with itself! This block in the paper has an output depth of 1154. That will not work given the output depth of Reduction-A is 1152.

Reduction-B

Again, ignoring depth differences, the code structure (mixed_7a) matches the paper structure (Fig. 18) nicely. We can now track depths again. The depths of convolutional branches in the block are 384, 288, and 320. This total is 992. The remaining branch is a max pooling layer, so its depth will match that of the input. In the paper, this input is either 1152 or 1154 and in the code it is 1088. By necessity, the output depth here must equal the next input's and, because the next block is repeated, its output depth must equal its input. That depth may either be (according to the paper) 1152 + 992 = 2144 or 1154 + 992 = 2146 or (from the code,) 1088 + 992 = 2080.

Inception-ResNet-C

In the code (block8) the structure here matches the paper in the same was as its predecessors: the repeat number is off by factor of 2 and the final output depth is inconsistent. Aside from the stem, the code is at least consistently inconsistent. The paper, though, gets worse from here. Remember, the output of this layer must either be 2144 or 2146 to be consistent with any part of the paper. It is not. Figure 19 clearly shows a final 1x1 Conv with 2048 channels. This output matches neither of the paper's own options nor the code.

What's left to do but throw my hands up and say, "I don't know." I'm not sure anyone does. This is a mess. On top of it all, the caption for figure 18 mentions Inception-ResNet-v1 when it must be referring to Inception-ResNet-v2. Initially, my assumption was that the paper was the ground-truth, but now, I am not so sure. Is it possible that whoever wrote the Pytorch code followed the original TensorFlow implementation directly without using the publication? After all, it does work and is self-consistent. Perhaps, it's the paper that's actually not consistent with the original code. By the authors' own words, they "...tried several versions of the residual version of Inception." Could they have gotten modules mixed up in the final publication? It certainly wouldn't be crazy, given the number of fine details they had to keep track of across "several" models. To quote a Season 46 survivor, "Last time I checked, several is seven" and seven models (more or less, because I know better than him) is a lot when each model is this deep.

UPDATE: The code here matches to a "T" the code in the original TF research repo.

In a blog post shortly after the paper released, Google does properly describe the model found in the code. Unfortunately, they claim in that blog post that "[t]he full details of the model are in our arXiv preprint," which is clearly inaccurate. Issue #819 of the TensorFlow repo is the only other acknowledgement I have managed to find on this; it is closed with only a reference to said blog, while, the code documentation and blog still reference the paper directly, continuing the confusion.

Cadene / pretrained-models.pytorch