facebookresearch / Mask2Former

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"
MIT License
2.47k stars 373 forks source link

How should I fix the input size during testing? #238

Open klkl2164 opened 4 months ago

klkl2164 commented 4 months ago

I have modified the backbone of Mask2Former to Vmamba, which requires the input size of my model to be fixed, for example, 640x640. This is not an issue during training because the train_dataloader outputs cropped images, and I just need to specify the specific crop parameters. However, I encountered a problem during testing. I am not sure how the test_dataloader operates exactly (I am not very familiar with the detectron2 framework and couldn't find the specific code location). During testing, the width and height of the images are not equal, with one of them being 640. My question is, which part of the code should I modify to ensure that the input images to the model are 640x640 during testing? I don't need any other data augmentation methods. I would greatly appreciate it if someone could provide an answer.

zhengyuan-xie commented 3 months ago

Same question. I resize the images in the forward function during the inference period, but it is not elegant :(

klkl2164 commented 3 months ago

Same question. I resize the images in the forward function during the inference period, but it is not elegant :(

I use HUST's ViM as the backbonehttps://github.com/hustvl/Vim/blob/main/vim/models_mamba.py, in which PatchEmbed specifies the input size. I followed the Swin Transformer and added a padding operation, so non-fixed inputs can be used. Fortunately, both ViM and Mask2Former's pixel decoder do not have many requirements for input size. You can try modifying PatchEmbed in this way. ''' class PatchEmbedfromswintransformer(nn.Module):

def __init__(self, img_size=224, patch_size=16, stride=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
    super().__init__()
    img_size = to_2tuple(img_size)
    patch_size = to_2tuple(patch_size)
    self.img_size = img_size
    self.patch_size = patch_size
    self.grid_size = ((img_size[0] - patch_size[0]) // stride + 1, (img_size[1] - patch_size[1]) // stride + 1)
    self.num_patches = self.grid_size[0] * self.grid_size[1]
    self.flatten = flatten

    self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride)
    self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

def forward(self, x):
    """Forward function."""
    # padding
    _, _, H, W = x.size()
    if W % self.patch_size[1] != 0:
        x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))
    if H % self.patch_size[0] != 0:
        x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))

    x = self.proj(x)  # B C Wh Ww

    if self.flatten:
        x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
    x = self.norm(x)

    return x

'''

zhengyuan-xie commented 3 months ago

Same question. I resize the images in the forward function during the inference period, but it is not elegant :(

I use HUST's ViM as the backbonehttps://github.com/hustvl/Vim/blob/main/vim/models_mamba.py, in which PatchEmbed specifies the input size. I followed the Swin Transformer and added a padding operation, so non-fixed inputs can be used. Fortunately, both ViM and Mask2Former's pixel decoder do not have many requirements for input size. You can try modifying PatchEmbed in this way. ''' class PatchEmbedfromswintransformer(nn.Module):

def __init__(self, img_size=224, patch_size=16, stride=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
    super().__init__()
    img_size = to_2tuple(img_size)
    patch_size = to_2tuple(patch_size)
    self.img_size = img_size
    self.patch_size = patch_size
    self.grid_size = ((img_size[0] - patch_size[0]) // stride + 1, (img_size[1] - patch_size[1]) // stride + 1)
    self.num_patches = self.grid_size[0] * self.grid_size[1]
    self.flatten = flatten

    self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride)
    self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

def forward(self, x):
    """Forward function."""
    # padding
    _, _, H, W = x.size()
    if W % self.patch_size[1] != 0:
        x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))
    if H % self.patch_size[0] != 0:
        x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))

    x = self.proj(x)  # B C Wh Ww

    if self.flatten:
        x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
    x = self.norm(x)

    return x

'''

Thanks!