apple / ml-aim

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
Other
1.05k stars 50 forks source link

Mismatches between ViT-H/14 in AIM and ViT-H/14 in MAE #7

Open TonyLianLong opened 9 months ago

TonyLianLong commented 9 months ago

AIM-600M:

def aim_600M(img_size: Union[int, Tuple[int, int]] = 224, **kwargs: Any) -> AIM:
    preprocessor, trunk, head = _aim(
        img_size=img_size,
        patch_size=14,
        embed_dim=1536,
        num_blocks=24,
        num_heads=12,
        **kwargs,
    )
    return AIM(preprocessor, trunk, head)

https://github.com/apple/ml-aim/blob/0b1dea9128f4734ae89252078e65aa102999407a/aim/torch/models.py#L176-L185

MAE ViT-H/14:

def vit_huge_patch14(**kwargs):
    model = VisionTransformer(
        patch_size=14, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
    return model

https://github.com/facebookresearch/mae/blob/efb2a8062c206524e35e47d04501ed4f544c0ae8/models_vit.py#L70-L74

The models have very different embedding dimensions, depth, and num_heads, and are incompatible with each other. However, in Tab. 6 of the paper, these two works share the same architecture in "Arch." column. Are the two architectures different, as it shows in the code? If so, it should probably be clarified in terms of the number of parameters in the paper.