Open ycshao21 opened 3 months ago
Thank you very much for reporting this issue. You correctly pointed out that the actual model architecture is:
Conv3×3
GroupNorm16
LeakyReLU
Conv3×3
GroupNorm16
LeakyReLU
PatchMerge
LayerNorm
Linear
Our code and pretrained weights match this architecture. There is a mistake in Table 9 of our Appendix, which incorrectly described the architecture.
Thanks for your excellent contribution! I'm trying to understand the structure of CuboidTransformerModel, but I find the implementation is not consistent with the description in the paper.
According to Table 9 in the paper, the
2D CNN+Downsampler
for SEVIR has the following design:But I notice that in class
InitialStackPatchMergeEncoder
:Each
Conv3×3
is followed by aGroupNorm
and aLeakyReLU
, which means the structure is actually:I wonder if there is a mistake in the implementation of the module. It seems the last two lines shouldn't stay in the for loop, or did I get it wrong?