chinhsuanwu / mobilevit-pytorch

A PyTorch implementation of "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer"
https://arxiv.org/abs/2110.02178
MIT License
501 stars 70 forks source link

请问下 您这个实现版本跟原文的模型参数 有对比吗 性能方面呢 #1

Closed fakerhbj closed 3 years ago

chinhsuanwu commented 3 years ago

Hi @fakerhbj

Q: Are the model settings consistent with the paper? And how about the performance?

A: Yes, the model is built using the settings in the paper, though some implementation details may not be exactly the same. FYI, the numbers of the parameters are also similar to the reported ones.

XXS: 1331472
XS: 2382944
S: 5636720

img

As for the second question, this repo is just an implementation of the architecture, and I have not trained on ImageNet. So I cannot give you the answer about it.

loveq007 commented 2 years ago

There are some differences about transformer with official implementation. https://github.com/apple/ml-cvnets/blob/d38a116fe134a8cd5db18670764fdaafd39a5d4f/cvnets/modules/transformer.py#L14 The official implementation:

pre_norm_mha = nn.Sequential(
            get_normalization_layer(opts=opts, norm_type=transformer_norm_layer, num_features=embed_dim),
            MultiHeadAttention(embed_dim, num_heads, attn_dropout=attn_dropout, bias=True),
            Dropout(p=dropout)
        )
pre_norm_ffn = nn.Sequential(
            get_normalization_layer(opts=opts, norm_type=transformer_norm_layer, num_features=embed_dim),
            LinearLayer(in_features=embed_dim, out_features=ffn_latent_dim, bias=True),
            self.build_act_layer(opts=opts),
            Dropout(p=ffn_dropout),
            LinearLayer(in_features=ffn_latent_dim, out_features=embed_dim, bias=True),
            Dropout(p=dropout)
        )

After transformer encode, it appends a normalization layer:

global_rep = [
            TransformerEncoder(opts=opts, embed_dim=transformer_dim, ffn_latent_dim=ffn_dims[block_idx], num_heads=num_heads,
                               attn_dropout=attn_dropout, dropout=dropout, ffn_dropout=ffn_dropout,
                               transformer_norm_layer=transformer_norm_layer)
            for block_idx in range(n_transformer_blocks)
        ]
        global_rep.append(
            get_normalization_layer(opts=opts, norm_type=transformer_norm_layer, num_features=transformer_dim)
        )