THU-MIG / RepViT

RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything
https://arxiv.org/abs/2307.09283
Apache License 2.0
756 stars 56 forks source link

Help with Understanding the Depth of RepViT-M2.3 Model and Code Verification #69

Open MiguelMC-UNEX opened 3 months ago

MiguelMC-UNEX commented 3 months ago

Hi everyone,

I am currently working with the RepViT-M2.3 model and I am trying to understand the correct configuration for its depth. Specifically, I want to verify if my implementation of the Multi_Level_Extract class aligns with the model specifications. Here's the code I have so far:

import torch
from torch import nn

class Multi_Level_Extract(nn.Module):
    def __init__(self, out_channels):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(3, out_channels[0], 7, 2, 3, bias=False),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels[0], out_channels[1], 3, 2, 1, bias=False),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels[1], out_channels[2], 3, 2, 1, bias=False),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels[2], out_channels[3], 3, 2, 1, bias=False),
        )

    def forward(self, x):
        return self.seq(x)

According to my understanding, the depth for the RepViT-M2.3 model should be 34 layers. Here is the configuration part for the SelfAttention class:

class SelfAttention(nn.Module):
    def __init__(self, model_type="m2_3", pretrained=True):
        super(SelfAttention, self).__init__()
        model_config = {
            "m2_3": {
                "d_model": 640,
                "depth": 34,
                "heads": 16,
                "mlp_dim": 2560,
                "model_path": "./model/repvit_m2_3_distill_450e.pth",
                "out_channels": [64, 128, 256, 640]
                },
            # Other configurations...
        }
        # Rest of the class implementation...

My questions are:

Is the depth of 34 layers correct for the RepViT-M2.3 model? I have seen different sources mentioning varying depths, and I want to make sure my configuration is accurate. Does my implementation of the Multi_Level_Extract class align with the RepViT-M2.3 model specifications? Is there anything I need to change to better fit the model's architecture?