I noticed that the argument patch_size is not actually being used for the OverlapPatchEmbed modules.
Instead you hard coded a patch sizes of [7, 3, 3, 3] for the 4 blocks. While this of course is still smaller than the 16x16 patches in ViT, and thus still lends itself better to detection and segmentation tasks, the model deviates from the paper, where you describe an initial patch size of 4 being used. This also means that classes inheriting from this class do not use the argument at all!
Maybe I am misunderstanding something, so I would be happy if you could shed some light on this potential mistake! Thank you.
Hi there,
first of all thank you for your work and providing all the code! I was looking at the following lines in the SegFormer backbone model:
https://github.com/NVlabs/SegFormer/blob/65fa8cfa9b52b6ee7e8897a98705abf8570f9e32/mmseg/models/backbones/mix_transformer.py#L203-L220
I noticed that the argument
patch_size
is not actually being used for theOverlapPatchEmbed
modules.Instead you hard coded a patch sizes of [7, 3, 3, 3] for the 4 blocks. While this of course is still smaller than the 16x16 patches in ViT, and thus still lends itself better to detection and segmentation tasks, the model deviates from the paper, where you describe an initial patch size of 4 being used. This also means that classes inheriting from this class do not use the argument at all!
Maybe I am misunderstanding something, so I would be happy if you could shed some light on this potential mistake! Thank you.