Question about the 2D CNN+Downsampler

Thanks for your excellent contribution! I'm trying to understand the structure of CuboidTransformerModel, but I find the implementation is not consistent with the description in the paper.

According to Table 9 in the paper, the 2D CNN+Downsampler for SEVIR has the following design:

Conv3×3
Conv3×3
GroupNorm16
LeakyReLU
PatchMerge
LayerNorm
Linear

But I notice that in class InitialStackPatchMergeEncoder:

conv_block = []
for j in range(self.num_conv_per_merge_list[i]):
    if j == 0:
        conv_in_dim = in_dim
    else:
        conv_in_dim = out_dim
    conv_block.append(nn.Conv2d(kernel_size=(3, 3), padding=(1, 1),
                            in_channels=conv_in_dim, out_channels=out_dim))
    conv_block.append(nn.GroupNorm(self.num_group_list[i], out_dim))
    conv_block.append(get_activation(activation))

Each Conv3×3 is followed by a GroupNorm and a LeakyReLU, which means the structure is actually:

Conv3×3
GroupNorm16
LeakyReLU
Conv3×3
GroupNorm16
LeakyReLU
PatchMerge
LayerNorm
Linear

I wonder if there is a mistake in the implementation of the module. It seems the last two lines shouldn't stay in the for loop, or did I get it wrong?

amazon-science / earth-forecasting-transformer

Question about the 2D CNN+Downsampler #69