Closed fakerhbj closed 3 years ago
There are some differences about transformer with official implementation. https://github.com/apple/ml-cvnets/blob/d38a116fe134a8cd5db18670764fdaafd39a5d4f/cvnets/modules/transformer.py#L14 The official implementation:
pre_norm_mha = nn.Sequential(
get_normalization_layer(opts=opts, norm_type=transformer_norm_layer, num_features=embed_dim),
MultiHeadAttention(embed_dim, num_heads, attn_dropout=attn_dropout, bias=True),
Dropout(p=dropout)
)
pre_norm_ffn = nn.Sequential(
get_normalization_layer(opts=opts, norm_type=transformer_norm_layer, num_features=embed_dim),
LinearLayer(in_features=embed_dim, out_features=ffn_latent_dim, bias=True),
self.build_act_layer(opts=opts),
Dropout(p=ffn_dropout),
LinearLayer(in_features=ffn_latent_dim, out_features=embed_dim, bias=True),
Dropout(p=dropout)
)
After transformer encode, it appends a normalization layer:
global_rep = [
TransformerEncoder(opts=opts, embed_dim=transformer_dim, ffn_latent_dim=ffn_dims[block_idx], num_heads=num_heads,
attn_dropout=attn_dropout, dropout=dropout, ffn_dropout=ffn_dropout,
transformer_norm_layer=transformer_norm_layer)
for block_idx in range(n_transformer_blocks)
]
global_rep.append(
get_normalization_layer(opts=opts, norm_type=transformer_norm_layer, num_features=transformer_dim)
)
Hi @fakerhbj
Q: Are the model settings consistent with the paper? And how about the performance?
A: Yes, the model is built using the settings in the paper, though some implementation details may not be exactly the same. FYI, the numbers of the parameters are also similar to the reported ones.
As for the second question, this repo is just an implementation of the architecture, and I have not trained on ImageNet. So I cannot give you the answer about it.