Open benzyz1129 opened 3 years ago
I meet the same question. According to the paper, when img_size = 224, the patch_size=224 // 16 // 16 = 0.
In train.py, the grid_size is reset. So, when img_size = 224, the grid_size is 224//116=14, and the patch_size=224//14//16 = 1.
if args.vit_name.find('R50') != -1:
config_vit.patches.grid = (int(args.img_size / args.vit_patches_size), int(args.img_size / args.vit_patches_size))
Thanks for your work. I have some questions about the patch size of patch embedding when using CNN and Transformer as the encoder.
In the section 3.2 of the paper, it mentions that patch embedding is applied to 1x1 patches extracted from the CNN feature maps instead of from raw image when using CNN-Tranformer hybrid as the encoder.
From my understanding, regardless of the height and width of the feature map extracted from CNN, the patch embedding will be the nn.Conv2d with kernel_size=1 and stride=1.
Here is the code.
When img_size=512, and configurations in get_r50_b16_config is applied, the outputs of patch_embedding will be a tensor which shape is (B, 1024, 16, 16). The height and width is 1/32, not 1/16 of the original image size. So you will need total 5 times of upsampling operations instead of 4 times, which is different from your implementation.
Shouldn't the kernel_size and stride be 1 when using CNN-Tranformer as the encoder?
I would be very grateful for letting me know if it is my misunderstanding.