Closed roger-cv closed 2 years ago
The fusion can also be done at 64x64 resolution but that would be too computationally expensive since a transformer is used (quadratic complexity due to attention), so I reduced the size to 8x8 at each resolution of the intermediate feature maps.
Thanks for your quick reply. I guess that the input feature map of the transformer of each layer will be downsampled to 8*8 according to what you mean?
that's correct, now there are several variants of transformer which address the quadratic complexity issue of the transformer (eg. Linformer) so maybe it's possible to use the transformer without downsampling.
that's correct, now there are several variants of transformer which address the quadratic complexity issue of the transformer (eg. Linformer) so maybe it's possible to use the transformer without downsampling.
Ok, Another interesting question is that can this fusion fashion based on the transformer be replaced with other transformers, such as swim or PVT. Because I notice that this transformer is developed based on the GPT suited for the NLP area.
I agree, architecture design can be improved quite a bit.
Ok, Nice work, Thanks for your reply.
But it may require more resources to train...
I agree, architecture design can be improved quite a bit.
Hi, Thanks for this nice work. Recently, I am trying to modify this excellent model to make it suit for my work. As you see, why the "average" operation is required here, the shape of the tensor changes from 6464 to 88 after the "average" operation. However, the shape of the tensor should be the 64*64 just like the description" fusion at (B,64,64,64)".