Beckschen / ViTamin

[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"
Apache License 2.0
161 stars 5 forks source link

Validity of models uploaded to huggingface #7

Closed ytaek-oh closed 4 months ago

ytaek-oh commented 4 months ago

Dear authors, thank you for sharing the codes and checkpoints with your amazing project! I really appreciate that it is easily accessible via huggingface.

While testing the list of models provided in the main page of your repo, some of the models do not work correctly as detailed below. Could you possibly take a look with your model's availability?

 File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3404, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/appuser/.cache/huggingface/modules/transformers_modules/jienengchen/ViTamin-L2-224px/0e467ba2fb5ea5735acc161718f419d8968637fb/model.py", line 420, in __init__
    self.visual = _build_vision_tower(embed_dim, vision_cfg, quick_gelu, cast_dtype)
  File "/home/appuser/.cache/huggingface/modules/transformers_modules/jienengchen/ViTamin-L2-224px/0e467ba2fb5ea5735acc161718f419d8968637fb/model.py", line 121, in _build_vision_tower
    visual = TimmModel(
  File "/home/appuser/.cache/huggingface/modules/transformers_modules/jienengchen/ViTamin-L2-224px/0e467ba2fb5ea5735acc161718f419d8968637fb/timm_model.py", line 71, in __init__
    self.trunk = timm.create_model(
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/timm/models/_factory.py", line 113, in create_model
    raise RuntimeError('Unknown model (%s)' % model_name)
RuntimeError: Unknown model (vitamin_large_224)


  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 523, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1138, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/utils/hub.py", line 452, in cached_file
    raise EnvironmentError(
OSError: jienengchen/ViTamin-L-256px does not appear to have a file named config.json. Checkout 'https://huggingface.co/jienengchen/ViTamin-L-256px/main' for available files.


For example,

Some weights of the model checkpoint at jienengchen/ViTamin-L2-384px were not used when initializing ViTaminCLIP: ['ln_final.bias', 'ln_final.weight', 'positional_embedding', ... ] # seemingly all params 
Some weights of ViTaminCLIP were not initialized from the model checkpoint at jienengchen/ViTamin-L2-384px and are newly initialized:  ['text.ln_final.bias', 'text.ln_final.weight', ... ]  # seemingly all params

This results in nearly zero performance due to un-initialized models.


  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 523, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1138, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/home/appuser/.conda/envs/vita/lib/python3.9/site-packages/transformers/utils/hub.py", line 452, in cached_file
    raise EnvironmentError(
OSError: jienengchen/ViTamin-XL-336px does not appear to have a file named config.json. Checkout 'https://huggingface.co/jienengchen/ViTamin-XL-336px/main' for available files.


The remaining models were successfully initialized and running evaluation was possible, which results in numbers aligned to your report except for jienengchen/ViTamin-L-384px. When tested on zero-shot imagenet classification, the jienengchen/ViTamin-L-384px checkpoint gives 72.0, not your number 81.8.

You may take a quick test on zero-shot imagenet with the provided attachment below. It tries to download model checkpoints from the corresponding hf repository.

Reproduced:

ViTamin-L-336px: 81.61, ViTamin-XL-256px: 82.24, ViTamin-XL-384px: 82.88

Not reproduced:

ViTamin-L-384px: 72.00



Best regards,
ytaek-oh commented 4 months ago

As a reference, my evaluation results on valid models:

arch coco_i2t coco_t2i flickr_i2t flickr_t2i winogavil IN1k
ViTamin-L-336px 64.28 47.13 89.60 74.50 46.69 81.61
ViTamin-L-384px (not reproduced) 42.08 36.57 67.60 61.94 44.90 72.00
ViTamin-XL-384px 68.02 50.08 92.00 78.34 43.36 82.88
ViTamin-XL-256px 67.32 49.47 90.60 77.62 43.84 82.24
Beckschen commented 4 months ago

Thanks so much! I have updated the model initialization (config.json) of ViTamin-L-224px, ViTamin-L2-224px, and ViTamin-L-256px. I have also updated the model of ViTamin-XL-336px (was failed to push to HF), and fixed the model jienengchen/ViTamin-L-384px (due to positional embedding) as well.

I have used the mentioned evaluation pipeline to evaluate jienengchen/ViTamin-L-384px, and can reproduce the results with zero-shot ImageNet-1k score of 81.79% for now.

Your suggestions are super helpful. Please let me know if you have further questions.

ytaek-oh commented 4 months ago

Thank you so much for your prompt reply and resolving the issue!

I have downloaded the updated version from huggingface, and corresponding evaluation results are as below. I found that L2 models from huggingface still have the same issue of loading pretrained checkpoints, but when I loaded them locally (borrowing your term, open_clip interface), they worked fine.

=== Results of ViTamin-L-224px model ===
ImageNet zero-shot accuracy: 80.752  # official report: 80.8

=== Results of ViTamin-L-256px model ===
ImageNet zero-shot accuracy: 81.184  # official report: 81.2

=== Results of ViTamin-L-336px model ===
ImageNet zero-shot accuracy: 81.606  # official report: 81.6

=== Results of ViTamin-L-384px model ===
ImageNet zero-shot accuracy: 81.794  # official report: 81.8

=== Results of ViTamin-XL-256px model ===
ImageNet zero-shot accuracy: 82.292  # official report: 82.3

=== Results of ViTamin-XL-336px model ===
ImageNet zero-shot accuracy: 82.696  # official report: 82.7

=== Results of ViTamin-XL-384px model ===
ImageNet zero-shot accuracy: 82.882  # official report: 82.9

=== Results of ViTamin-L2-224px model ===
ImageNet zero-shot accuracy: 0.1  # huggingface interface: not using pretrained checkpoint
ImageNet zero-shot accuracy: 80.896  # openclip interface, official report: 80.9

=== Results of ViTamin-L2-336px model ===
ImageNet zero-shot accuracy: 0.1  # huggingface interface: not using pretrained checkpoint
ImageNet zero-shot accuracy: 81.426  # openclip interface, official report: 81.5

=== Results of ViTamin-L2-256px model ===
ImageNet zero-shot accuracy: 0.1  # huggingface interface: not using pretrained checkpoint
ImageNet zero-shot accuracy: 81.79  # openclip interface, official report: 81.8

=== Results of ViTamin-L2-384px model ===
ImageNet zero-shot accuracy: 0.1  # huggingface  interface: not using pretrained checkpoint
ImageNet zero-shot accuracy: 82.066  # openclip interface, official report: 82.1

The full log including warnings when loading L2 models via huggingface:

click to expand ``` (vita) appuser@d009b06d8de2:~/train/clip_zeroshot_imagenet$ python eval_imagenet_clip.py --model ViTamin-L2-224px # download config.json: 100% 547/547 [00:00<00:00, 72.1kB/s] configuration_vitamin.py: 100% 5.47k/5.47k [00:00<00:00, 461kB/s] A new version of the following files was downloaded from https://huggingface.co/jienengchen/ViTamin-L2-224px: - configuration_vitamin.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. model.py: 100% 28.0k/28.0k [00:00<00:00, 5.99MB/s] A new version of the following files was downloaded from https://huggingface.co/jienengchen/ViTamin-L2-224px: - model.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. pytorch_model.bin: 100% 2.75G/2.75G [17:03<00:00, 2.69MB/s] # warnings (1) and (2) are not relevant to the failure of loading checkpoints; they also occur in the other valid models (1) /home/appuser/.conda/envs/vita/lib/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() (2) No pretrained configuration specified for vitamin_large model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly. Some weights of the model checkpoint at jienengchen/ViTamin-L2-224px were not used when initializing ViTaminCLIP: ['ln_final.bias', 'ln_final.weight', 'positional_embedding', 'text_projection', 'token_embedding.weight', 'transformer.resblocks.0.attn.in_proj_bias', 'transformer.resblocks.0.attn.in_proj_weight', 'transformer.resblocks.0.attn.out_proj.bias', 'transformer.resblocks .0.attn.out_proj.weight', 'transformer.resblocks.0.ln_1.bias', 'transformer.resblocks.0.ln_1.weight', 'transformer.resblocks.0.ln_2.bias', 'transformer.resblocks.0.ln_2.weight', 'transformer.resblocks.0.mlp.c_fc.bias', 'transformer.resblocks.0.mlp.c_fc.weight', 'transformer.resblocks.0.mlp.c_proj.bias', 'transformer.resblocks.0.mlp.c_proj.weight', 'transformer.resblocks.1.attn. in_proj_bias', 'transformer.resblocks.1.attn.in_proj_weight', 'transformer.resblocks.1.attn.out_proj.bias', 'transformer.resblocks.1.attn.out_proj.weight', 'transformer.resblocks.1.ln_1.bias', 'transformer.resblocks.1.ln_1.weight', 'transformer.resblocks.1.ln_2.bias', 'transformer.resblocks.1.ln_2.weight', 'transformer.resblocks.1.mlp.c_fc.bias', 'transformer.resblocks.1.mlp.c_ fc.weight', 'transformer.resblocks.1.mlp.c_proj.bias', 'transformer.resblocks.1.mlp.c_proj.weight', 'transformer.resblocks.10.attn.in_proj_bias', 'transformer.resblocks.10.attn.in_proj_weight', 'transformer.resblocks.10.attn.out_proj.bias', 'transformer.resblocks.10.attn.out_proj.weight', 'transformer.resblocks.10.ln_1.bias', 'transformer.resblocks.10.ln_1.weight', 'transformer .resblocks.10.ln_2.bias', 'transformer.resblocks.10.ln_2.weight', 'transformer.resblocks.10.mlp.c_fc.bias', 'transformer.resblocks.10.mlp.c_fc.weight', 'transformer.resblocks.10.mlp.c_proj.bias', 'transformer.resblocks.10.mlp.c_proj.weight', 'transformer.resblocks.11.attn.in_proj_bias', 'transformer.resblocks.11.attn.in_proj_weight', 'transformer.resblocks.11.attn.out_proj.bias ', 'transformer.resblocks.11.attn.out_proj.weight', 'transformer.resblocks.11.ln_1.bias', 'transformer.resblocks.11.ln_1.weight', 'transformer.resblocks.11.ln_2.bias', 'transformer.resblocks.11.ln_2.weight', 'transformer.resblocks.11.mlp.c_fc.bias', 'transformer.resblocks.11.mlp.c_fc.weight', 'transformer.resblocks.11.mlp.c_proj.bias', 'transformer.resblocks.11.mlp.c_proj.weigh t', 'transformer.resblocks.12.attn.in_proj_bias', 'transformer.resblocks.12.attn.in_proj_weight', 'transformer.resblocks.12.attn.out_proj.bias', 'transformer.resblocks.12.attn.out_proj.weight', 'transformer.resblocks.12.ln_1.bias', 'transformer.resblocks.12.ln_1.weight', 'transformer.resblocks.12.ln_2.bias', 'transformer.resblocks.12.ln_2.weight', 'transformer.resblocks.12.mlp. c_fc.bias', 'transformer.resblocks.12.mlp.c_fc.weight', 'transformer.resblocks.12.mlp.c_proj.bias', 'transformer.resblocks.12.mlp.c_proj.weight', 'transformer.resblocks.13.attn.in_proj_bias', 'transformer.resblocks.13.attn.in_proj_weight', 'transformer.resblocks.13.attn.out_proj.bias', 'transformer.resblocks.13.attn.out_proj.weight', 'transformer.resblocks.13.ln_1.bias', 'trans former.resblocks.13.ln_1.weight', 'transformer.resblocks.13.ln_2.bias', 'transformer.resblocks.13.ln_2.weight', 'transformer.resblocks.13.mlp.c_fc.bias', 'transformer.resblocks.13.mlp.c_fc.weight', 'transformer.resblocks.13.mlp.c_proj.bias', 'transformer.resblocks.13.mlp.c_proj.weight', 'transformer.resblocks.14.attn.in_proj_bias', 'transformer.resblocks.14.attn.in_proj_weight' , 'transformer.resblocks.14.attn.out_proj.bias', 'transformer.resblocks.14.attn.out_proj.weight', 'transformer.resblocks.14.ln_1.bias', 'transformer.resblocks.14.ln_1.weight', 'transformer.resblocks.14.ln_2.bias', 'transformer.resblocks.14.ln_2.weight', 'transformer.resblocks.14.mlp.c_fc.bias', 'transformer.resblocks.14.mlp.c_fc.weight', 'transformer.resblocks.14.mlp.c_proj.bia s', 'transformer.resblocks.14.mlp.c_proj.weight', 'transformer.resblocks.15.attn.in_proj_bias', 'transformer.resblocks.15.attn.in_proj_weight', 'transformer.resblocks.15.attn.out_proj.bias', 'transformer.resblocks.15.attn.out_proj.weight', 'transformer.resblocks.15.ln_1.bias', 'transformer.resblocks.15.ln_1.weight', 'transformer.resblocks.15.ln_2.bias', 'transformer.resblocks.1 5.ln_2.weight', 'transformer.resblocks.15.mlp.c_fc.bias', 'transformer.resblocks.15.mlp.c_fc.weight', 'transformer.resblocks.15.mlp.c_proj.bias', 'transformer.resblocks.15.mlp.c_proj.weight', 'transformer.resblocks.16.attn.in_proj_bias', 'transformer.resblocks.16.attn.in_proj_weight', 'transformer.resblocks.16.attn.out_proj.bias', 'transformer.resblocks.16.attn.out_proj.weight' , 'transformer.resblocks.16.ln_1.bias', 'transformer.resblocks.16.ln_1.weight', 'transformer.resblocks.16.ln_2.bias', 'transformer.resblocks.16.ln_2.weight', 'transformer.resblocks.16.mlp.c_fc.bias', 'transformer.resblocks.16.mlp.c_fc.weight', 'transformer.resblocks.16.mlp.c_proj.bias', 'transformer.resblocks.16.mlp.c_proj.weight', 'transformer.resblocks.17.attn.in_proj_bias', 'transformer.resblocks.17.attn.in_proj_weight', 'transformer.resblocks.17.attn.out_proj.bias', 'transformer.resblocks.17.attn.out_proj.weight', 'transformer.resblocks.17.ln_1.bias', 'transformer.resblocks.17.ln_1.weight', 'transformer.resblocks.17.ln_2.bias', 'transformer.resblocks.17.ln_2.weight', 'transformer.resblocks.17.mlp.c_fc.bias', 'transformer.resblocks.17.mlp.c_fc.wei ght', 'transformer.resblocks.17.mlp.c_proj.bias', 'transformer.resblocks.17.mlp.c_proj.weight', 'transformer.resblocks.18.attn.in_proj_bias', 'transformer.resblocks.18.attn.in_proj_weight', 'transformer.resblocks.18.attn.out_proj.bias', 'transformer.resblocks.18.attn.out_proj.weight', 'transformer.resblocks.18.ln_1.bias', 'transformer.resblocks.18.ln_1.weight', 'transformer.res blocks.18.ln_2.bias', 'transformer.resblocks.18.ln_2.weight', 'transformer.resblocks.18.mlp.c_fc.bias', 'transformer.resblocks.18.mlp.c_fc.weight', 'transformer.resblocks.18.mlp.c_proj.bias', 'transformer.resblocks.18.mlp.c_proj.weight', 'transformer.resblocks.19.attn.in_proj_bias', 'transformer.resblocks.19.attn.in_proj_weight', 'transformer.resblocks.19.attn.out_proj.bias', ' transformer.resblocks.19.attn.out_proj.weight', 'transformer.resblocks.19.ln_1.bias', 'transformer.resblocks.19.ln_1.weight', 'transformer.resblocks.19.ln_2.bias', 'transformer.resblocks.19.ln_2.weight', 'transformer.resblocks.19.mlp.c_fc.bias', 'transformer.resblocks.19.mlp.c_fc.weight', 'transformer.resblocks.19.mlp.c_proj.bias', 'transformer.resblocks.19.mlp.c_proj.weight', 'transformer.resblocks.2.attn.in_proj_bias', 'transformer.resblocks.2.attn.in_proj_weight', 'transformer.resblocks.2.attn.out_proj.bias', 'transformer.resblocks.2.attn.out_proj.weight', 'transformer.resblocks.2.ln_1.bias', 'transformer.resblocks.2.ln_1.weight', 'transformer.resblocks.2.ln_2.bias', 'transformer.resblocks.2.ln_2.weight', 'transformer.resblocks.2.mlp.c_fc.bias', ' transformer.resblocks.2.mlp.c_fc.weight', 'transformer.resblocks.2.mlp.c_proj.bias', 'transformer.resblocks.2.mlp.c_proj.weight', 'transformer.resblocks.20.attn.in_proj_bias', 'transformer.resblocks.20.attn.in_proj_weight', 'transformer.resblocks.20.attn.out_proj.bias', 'transformer.resblocks.20.attn.out_proj.weight', 'transformer.resblocks.20.ln_1.bias', 'transformer.resblocks .20.ln_1.weight', 'transformer.resblocks.20.ln_2.bias', 'transformer.resblocks.20.ln_2.weight', 'transformer.resblocks.20.mlp.c_fc.bias', 'transformer.resblocks.20.mlp.c_fc.weight', 'transformer.resblocks.20.mlp.c_proj.bias', 'transformer.resblocks.20.mlp.c_proj.weight', 'transformer.resblocks.21.attn.in_proj_bias', 'transformer.resblocks.21.attn.in_proj_weight', 'transformer.r esblocks.21.attn.out_proj.bias', 'transformer.resblocks.21.attn.out_proj.weight', 'transformer.resblocks.21.ln_1.bias', 'transformer.resblocks.21.ln_1.weight', 'transformer.resblocks.21.ln_2.bias', 'transformer.resblocks.21.ln_2.weight', 'transformer.resblocks.21.mlp.c_fc.bias', 'transformer.resblocks.21.mlp.c_fc.weight', 'transformer.resblocks.21.mlp.c_proj.bias', 'transformer .resblocks.21.mlp.c_proj.weight', 'transformer.resblocks.22.attn.in_proj_bias', 'transformer.resblocks.22.attn.in_proj_weight', 'transformer.resblocks.22.attn.out_proj.bias', 'transformer.resblocks.22.attn.out_proj.weight', 'transformer.resblocks.22.ln_1.bias', 'transformer.resblocks.22.ln_1.weight', 'transformer.resblocks.22.ln_2.bias', 'transformer.resblocks.22.ln_2.weight', 'transformer.resblocks.22.mlp.c_fc.bias', 'transformer.resblocks.22.mlp.c_fc.weight', 'transformer.resblocks.22.mlp.c_proj.bias', 'transformer.resblocks.22.mlp.c_proj.weight', 'transformer.resblocks.23.attn.in_proj_bias', 'transformer.resblocks.23.attn.in_proj_weight', 'transformer.resblocks.23.attn.out_proj.bias', 'transformer.resblocks.23.attn.out_proj.weight', 'transformer.r esblocks.23.ln_1.bias', 'transformer.resblocks.23.ln_1.weight', 'transformer.resblocks.23.ln_2.bias', 'transformer.resblocks.23.ln_2.weight', 'transformer.resblocks.23.mlp.c_fc.bias', 'transformer.resblocks.23.mlp.c_fc.weight', 'transformer.resblocks.23.mlp.c_proj.bias', 'transformer.resblocks.23.mlp.c_proj.weight', 'transformer.resblocks.3.attn.in_proj_bias', 'transformer.resb locks.3.attn.in_proj_weight', 'transformer.resblocks.3.attn.out_proj.bias', 'transformer.resblocks.3.attn.out_proj.weight', 'transformer.resblocks.3.ln_1.bias', 'transformer.resblocks.3.ln_1.weight', 'transformer.resblocks.3.ln_2.bias', 'transformer.resblocks.3.ln_2.weight', 'transformer.resblocks.3.mlp.c_fc.bias', 'transformer.resblocks.3.mlp.c_fc.weight', 'transformer.resbloc ks.3.mlp.c_proj.bias', 'transformer.resblocks.3.mlp.c_proj.weight', 'transformer.resblocks.4.attn.in_proj_bias', 'transformer.resblocks.4.attn.in_proj_weight', 'transformer.resblocks.4.attn.out_proj.bias', 'transformer.resblocks.4.attn.out_proj.weight', 'transformer.resblocks.4.ln_1.bias', 'transformer.resblocks.4.ln_1.weight', 'transformer.resblocks.4.ln_2.bias', 'transformer. resblocks.4.ln_2.weight', 'transformer.resblocks.4.mlp.c_fc.bias', 'transformer.resblocks.4.mlp.c_fc.weight', 'transformer.resblocks.4.mlp.c_proj.bias', 'transformer.resblocks.4.mlp.c_proj.weight', 'transformer.resblocks.5.attn.in_proj_bias', 'transformer.resblocks.5.attn.in_proj_weight', 'transformer.resblocks.5.attn.out_proj.bias', 'transformer.resblocks.5.attn.out_proj.weigh t', 'transformer.resblocks.5.ln_1.bias', 'transformer.resblocks.5.ln_1.weight', 'transformer.resblocks.5.ln_2.bias', 'transformer.resblocks.5.ln_2.weight', 'transformer.resblocks.5.mlp.c_fc.bias', 'transformer.resblocks.5.mlp.c_fc.weight', 'transformer.resblocks.5.mlp.c_proj.bias', 'transformer.resblocks.5.mlp.c_proj.weight', 'transformer.resblocks.6.attn.in_proj_bias', 'transf ormer.resblocks.6.attn.in_proj_weight', 'transformer.resblocks.6.attn.out_proj.bias', 'transformer.resblocks.6.attn.out_proj.weight', 'transformer.resblocks.6.ln_1.bias', 'transformer.resblocks.6.ln_1.weight', 'transformer.resblocks.6.ln_2.bias', 'transformer.resblocks.6.ln_2.weight', 'transformer.resblocks.6.mlp.c_fc.bias', 'transformer.resblocks.6.mlp.c_fc.weight', 'transform er.resblocks.6.mlp.c_proj.bias', 'transformer.resblocks.6.mlp.c_proj.weight', 'transformer.resblocks.7.attn.in_proj_bias', 'transformer.resblocks.7.attn.in_proj_weight', 'transformer.resblocks.7.attn.out_proj.bias', 'transformer.resblocks.7.attn.out_proj.weight', 'transformer.resblocks.7.ln_1.bias', 'transformer.resblocks.7.ln_1.weight', 'transformer.resblocks.7.ln_2.bias', 'tr ansformer.resblocks.7.ln_2.weight', 'transformer.resblocks.7.mlp.c_fc.bias', 'transformer.resblocks.7.mlp.c_fc.weight', 'transformer.resblocks.7.mlp.c_proj.bias', 'transformer.resblocks.7.mlp.c_proj.weight', 'transformer.resblocks.8.attn.in_proj_bias', 'transformer.resblocks.8.attn.in_proj_weight', 'transformer.resblocks.8.attn.out_proj.bias', 'transformer.resblocks.8.attn.out_ proj.weight', 'transformer.resblocks.8.ln_1.bias', 'transformer.resblocks.8.ln_1.weight', 'transformer.resblocks.8.ln_2.bias', 'transformer.resblocks.8.ln_2.weight', 'transformer.resblocks.8.mlp.c_fc.bias', 'transformer.resblocks.8.mlp.c_fc.weight', 'transformer.resblocks.8.mlp.c_proj.bias', 'transformer.resblocks.8.mlp.c_proj.weight', 'transformer.resblocks.9.attn.in_proj_bias ', 'transformer.resblocks.9.attn.in_proj_weight', 'transformer.resblocks.9.attn.out_proj.bias', 'transformer.resblocks.9.attn.out_proj.weight', 'transformer.resblocks.9.ln_1.bias', 'transformer.resblocks.9.ln_1.weight', 'transformer.resblocks.9.ln_2.bias', 'transformer.resblocks.9.ln_2.weight', 'transformer.resblocks.9.mlp.c_fc.bias', 'transformer.resblocks.9.mlp.c_fc.weight', 'transformer.resblocks.9.mlp.c_proj.bias', 'transformer.resblocks.9.mlp.c_proj.weight'] - This IS expected if you are initializing ViTaminCLIP from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing ViTaminCLIP from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of ViTaminCLIP were not initialized from the model checkpoint at jienengchen/ViTamin-L2-224px and are newly initialized: ['text.ln_final.bias', 'text.ln_final.weight', 'text.positional_embedding', 'text.text_projection', 'text.token_embedding.weight', 'text.transformer.resblocks.0.attn.in_proj_bias', 'text.transformer.resblocks.0.attn.in_proj_weight', 'text.transfo rmer.resblocks.0.attn.out_proj.bias', 'text.transformer.resblocks.0.attn.out_proj.weight', 'text.transformer.resblocks.0.ln_1.bias', 'text.transformer.resblocks.0.ln_1.weight', 'text.transformer.resblocks.0.ln_2.bias', 'text.transformer.resblocks.0.ln_2.weight', 'text.transformer.resblocks.0.mlp.c_fc.bias', 'text.transformer.resblocks.0.mlp.c_fc.weight', 'text.transformer.resbl ocks.0.mlp.c_proj.bias', 'text.transformer.resblocks.0.mlp.c_proj.weight', 'text.transformer.resblocks.1.attn.in_proj_bias', 'text.transformer.resblocks.1.attn.in_proj_weight', 'text.transformer.resblocks.1.attn.out_proj.bias', 'text.transformer.resblocks.1.attn.out_proj.weight', 'text.transformer.resblocks.1.ln_1.bias', 'text.transformer.resblocks.1.ln_1.weight', 'text.transfo rmer.resblocks.1.ln_2.bias', 'text.transformer.resblocks.1.ln_2.weight', 'text.transformer.resblocks.1.mlp.c_fc.bias', 'text.transformer.resblocks.1.mlp.c_fc.weight', 'text.transformer.resblocks.1.mlp.c_proj.bias', 'text.transformer.resblocks.1.mlp.c_proj.weight', 'text.transformer.resblocks.10.attn.in_proj_bias', 'text.transformer.resblocks.10.attn.in_proj_weight', 'text.trans former.resblocks.10.attn.out_proj.bias', 'text.transformer.resblocks.10.attn.out_proj.weight', 'text.transformer.resblocks.10.ln_1.bias', 'text.transformer.resblocks.10.ln_1.weight', 'text.transformer.resblocks.10.ln_2.bias', 'text.transformer.resblocks.10.ln_2.weight', 'text.transformer.resblocks.10.mlp.c_fc.bias', 'text.transformer.resblocks.10.mlp.c_fc.weight', 'text.transfo rmer.resblocks.10.mlp.c_proj.bias', 'text.transformer.resblocks.10.mlp.c_proj.weight', 'text.transformer.resblocks.11.attn.in_proj_bias', 'text.transformer.resblocks.11.attn.in_proj_weight', 'text.transformer.resblocks.11.attn.out_proj.bias', 'text.transformer.resblocks.11.attn.out_proj.weight', 'text.transformer.resblocks.11.ln_1.bias', 'text.transformer.resblocks.11.ln_1.weig ht', 'text.transformer.resblocks.11.ln_2.bias', 'text.transformer.resblocks.11.ln_2.weight', 'text.transformer.resblocks.11.mlp.c_fc.bias', 'text.transformer.resblocks.11.mlp.c_fc.weight', 'text.transformer.resblocks.11.mlp.c_proj.bias', 'text.transformer.resblocks.11.mlp.c_proj.weight', 'text.transformer.resblocks.12.attn.in_proj_bias', 'text.transformer.resblocks.12.attn.in_p roj_weight', 'text.transformer.resblocks.12.attn.out_proj.bias', 'text.transformer.resblocks.12.attn.out_proj.weight', 'text.transformer.resblocks.12.ln_1.bias', 'text.transformer.resblocks.12.ln_1.weight', 'text.transformer.resblocks.12.ln_2.bias', 'text.transformer.resblocks.12.ln_2.weight', 'text.transformer.resblocks.12.mlp.c_fc.bias', 'text.transformer.resblocks.12.mlp.c_f c.weight', 'text.transformer.resblocks.12.mlp.c_proj.bias', 'text.transformer.resblocks.12.mlp.c_proj.weight', 'text.transformer.resblocks.13.attn.in_proj_bias', 'text.transformer.resblocks.13.attn.in_proj_weight', 'text.transformer.resblocks.13.attn.out_proj.bias', 'text.transformer.resblocks.13.attn.out_proj.weight', 'text.transformer.resblocks.13.ln_1.bias', 'text.transforme r.resblocks.13.ln_1.weight', 'text.transformer.resblocks.13.ln_2.bias', 'text.transformer.resblocks.13.ln_2.weight', 'text.transformer.resblocks.13.mlp.c_fc.bias', 'text.transformer.resblocks.13.mlp.c_fc.weight', 'text.transformer.resblocks.13.mlp.c_proj.bias', 'text.transformer.resblocks.13.mlp.c_proj.weight', 'text.transformer.resblocks.14.attn.in_proj_bias', 'text.transforme r.resblocks.14.attn.in_proj_weight', 'text.transformer.resblocks.14.attn.out_proj.bias', 'text.transformer.resblocks.14.attn.out_proj.weight', 'text.transformer.resblocks.14.ln_1.bias', 'text.transformer.resblocks.14.ln_1.weight', 'text.transformer.resblocks.14.ln_2.bias', 'text.transformer.resblocks.14.ln_2.weight', 'text.transformer.resblocks.14.mlp.c_fc.bias', 'text.transfor mer.resblocks.14.mlp.c_fc.weight', 'text.transformer.resblocks.14.mlp.c_proj.bias', 'text.transformer.resblocks.14.mlp.c_proj.weight', 'text.transformer.resblocks.15.attn.in_proj_bias', 'text.transformer.resblocks.15.attn.in_proj_weight', 'text.transformer.resblocks.15.attn.out_proj.bias', 'text.transformer.resblocks.15.attn.out_proj.weight', 'text.transformer.resblocks.15.ln_1 .bias', 'text.transformer.resblocks.15.ln_1.weight', 'text.transformer.resblocks.15.ln_2.bias', 'text.transformer.resblocks.15.ln_2.weight', 'text.transformer.resblocks.15.mlp.c_fc.bias', 'text.transformer.resblocks.15.mlp.c_fc.weight', 'text.transformer.resblocks.15.mlp.c_proj.bias', 'text.transformer.resblocks.15.mlp.c_proj.weight', 'text.transformer.resblocks.16.attn.in_proj _bias', 'text.transformer.resblocks.16.attn.in_proj_weight', 'text.transformer.resblocks.16.attn.out_proj.bias', 'text.transformer.resblocks.16.attn.out_proj.weight', 'text.transformer.resblocks.16.ln_1.bias', 'text.transformer.resblocks.16.ln_1.weight', 'text.transformer.resblocks.16.ln_2.bias', 'text.transformer.resblocks.16.ln_2.weight', 'text.transformer.resblocks.16.mlp.c_ fc.bias', 'text.transformer.resblocks.16.mlp.c_fc.weight', 'text.transformer.resblocks.16.mlp.c_proj.bias', 'text.transformer.resblocks.16.mlp.c_proj.weight', 'text.transformer.resblocks.17.attn.in_proj_bias', 'text.transformer.resblocks.17.attn.in_proj_weight', 'text.transformer.resblocks.17.attn.out_proj.bias', 'text.transformer.resblocks.17.attn.out_proj.weight', 'text.trans former.resblocks.17.ln_1.bias', 'text.transformer.resblocks.17.ln_1.weight', 'text.transformer.resblocks.17.ln_2.bias', 'text.transformer.resblocks.17.ln_2.weight', 'text.transformer.resblocks.17.mlp.c_fc.bias', 'text.transformer.resblocks.17.mlp.c_fc.weight', 'text.transformer.resblocks.17.mlp.c_proj.bias', 'text.transformer.resblocks.17.mlp.c_proj.weight', 'text.transformer.r esblocks.18.attn.in_proj_bias', 'text.transformer.resblocks.18.attn.in_proj_weight', 'text.transformer.resblocks.18.attn.out_proj.bias', 'text.transformer.resblocks.18.attn.out_proj.weight', 'text.transformer.resblocks.18.ln_1.bias', 'text.transformer.resblocks.18.ln_1.weight', 'text.transformer.resblocks.18.ln_2.bias', 'text.transformer.resblocks.18.ln_2.weight', 'text.transfo rmer.resblocks.18.mlp.c_fc.bias', 'text.transformer.resblocks.18.mlp.c_fc.weight', 'text.transformer.resblocks.18.mlp.c_proj.bias', 'text.transformer.resblocks.18.mlp.c_proj.weight', 'text.transformer.resblocks.19.attn.in_proj_bias', 'text.transformer.resblocks.19.attn.in_proj_weight', 'text.transformer.resblocks.19.attn.out_proj.bias', 'text.transformer.resblocks.19.attn.out_p roj.weight', 'text.transformer.resblocks.19.ln_1.bias', 'text.transformer.resblocks.19.ln_1.weight', 'text.transformer.resblocks.19.ln_2.bias', 'text.transformer.resblocks.19.ln_2.weight', 'text.transformer.resblocks.19.mlp.c_fc.bias', 'text.transformer.resblocks.19.mlp.c_fc.weight', 'text.transformer.resblocks.19.mlp.c_proj.bias', 'text.transformer.resblocks.19.mlp.c_proj.weig ht', 'text.transformer.resblocks.2.attn.in_proj_bias', 'text.transformer.resblocks.2.attn.in_proj_weight', 'text.transformer.resblocks.2.attn.out_proj.bias', 'text.transformer.resblocks.2.attn.out_proj.weight', 'text.transformer.resblocks.2.ln_1.bias', 'text.transformer.resblocks.2.ln_1.weight', 'text.transformer.resblocks.2.ln_2.bias', 'text.transformer.resblocks.2.ln_2.weight ', 'text.transformer.resblocks.2.mlp.c_fc.bias', 'text.transformer.resblocks.2.mlp.c_fc.weight', 'text.transformer.resblocks.2.mlp.c_proj.bias', 'text.transformer.resblocks.2.mlp.c_proj.weight', 'text.transformer.resblocks.20.attn.in_proj_bias', 'text.transformer.resblocks.20.attn.in_proj_weight', 'text.transformer.resblocks.20.attn.out_proj.bias', 'text.transformer.resblocks.2 0.attn.out_proj.weight', 'text.transformer.resblocks.20.ln_1.bias', 'text.transformer.resblocks.20.ln_1.weight', 'text.transformer.resblocks.20.ln_2.bias', 'text.transformer.resblocks.20.ln_2.weight', 'text.transformer.resblocks.20.mlp.c_fc.bias', 'text.transformer.resblocks.20.mlp.c_fc.weight', 'text.transformer.resblocks.20.mlp.c_proj.bias', 'text.transformer.resblocks.20.mlp .c_proj.weight', 'text.transformer.resblocks.21.attn.in_proj_bias', 'text.transformer.resblocks.21.attn.in_proj_weight', 'text.transformer.resblocks.21.attn.out_proj.bias', 'text.transformer.resblocks.21.attn.out_proj.weight', 'text.transformer.resblocks.21.ln_1.bias', 'text.transformer.resblocks.21.ln_1.weight', 'text.transformer.resblocks.21.ln_2.bias', 'text.transformer.resb locks.21.ln_2.weight', 'text.transformer.resblocks.21.mlp.c_fc.bias', 'text.transformer.resblocks.21.mlp.c_fc.weight', 'text.transformer.resblocks.21.mlp.c_proj.bias', 'text.transformer.resblocks.21.mlp.c_proj.weight', 'text.transformer.resblocks.22.attn.in_proj_bias', 'text.transformer.resblocks.22.attn.in_proj_weight', 'text.transformer.resblocks.22.attn.out_proj.bias', 'text .transformer.resblocks.22.attn.out_proj.weight', 'text.transformer.resblocks.22.ln_1.bias', 'text.transformer.resblocks.22.ln_1.weight', 'text.transformer.resblocks.22.ln_2.bias', 'text.transformer.resblocks.22.ln_2.weight', 'text.transformer.resblocks.22.mlp.c_fc.bias', 'text.transformer.resblocks.22.mlp.c_fc.weight', 'text.transformer.resblocks.22.mlp.c_proj.bias', 'text.tran sformer.resblocks.22.mlp.c_proj.weight', 'text.transformer.resblocks.23.attn.in_proj_bias', 'text.transformer.resblocks.23.attn.in_proj_weight', 'text.transformer.resblocks.23.attn.out_proj.bias', 'text.transformer.resblocks.23.attn.out_proj.weight', 'text.transformer.resblocks.23.ln_1.bias', 'text.transformer.resblocks.23.ln_1.weight', 'text.transformer.resblocks.23.ln_2.bias' , 'text.transformer.resblocks.23.ln_2.weight', 'text.transformer.resblocks.23.mlp.c_fc.bias', 'text.transformer.resblocks.23.mlp.c_fc.weight', 'text.transformer.resblocks.23.mlp.c_proj.bias', 'text.transformer.resblocks.23.mlp.c_proj.weight', 'text.transformer.resblocks.3.attn.in_proj_bias', 'text.transformer.resblocks.3.attn.in_proj_weight', 'text.transformer.resblocks.3.attn. out_proj.bias', 'text.transformer.resblocks.3.attn.out_proj.weight', 'text.transformer.resblocks.3.ln_1.bias', 'text.transformer.resblocks.3.ln_1.weight', 'text.transformer.resblocks.3.ln_2.bias', 'text.transformer.resblocks.3.ln_2.weight', 'text.transformer.resblocks.3.mlp.c_fc.bias', 'text.transformer.resblocks.3.mlp.c_fc.weight', 'text.transformer.resblocks.3.mlp.c_proj.bias ', 'text.transformer.resblocks.3.mlp.c_proj.weight', 'text.transformer.resblocks.4.attn.in_proj_bias', 'text.transformer.resblocks.4.attn.in_proj_weight', 'text.transformer.resblocks.4.attn.out_proj.bias', 'text.transformer.resblocks.4.attn.out_proj.weight', 'text.transformer.resblocks.4.ln_1.bias', 'text.transformer.resblocks.4.ln_1.weight', 'text.transformer.resblocks.4.ln_2. bias', 'text.transformer.resblocks.4.ln_2.weight', 'text.transformer.resblocks.4.mlp.c_fc.bias', 'text.transformer.resblocks.4.mlp.c_fc.weight', 'text.transformer.resblocks.4.mlp.c_proj.bias', 'text.transformer.resblocks.4.mlp.c_proj.weight', 'text.transformer.resblocks.5.attn.in_proj_bias', 'text.transformer.resblocks.5.attn.in_proj_weight', 'text.transformer.resblocks.5.attn. out_proj.bias', 'text.transformer.resblocks.5.attn.out_proj.weight', 'text.transformer.resblocks.5.ln_1.bias', 'text.transformer.resblocks.5.ln_1.weight', 'text.transformer.resblocks.5.ln_2.bias', 'text.transformer.resblocks.5.ln_2.weight', 'text.transformer.resblocks.5.mlp.c_fc.bias', 'text.transformer.resblocks.5.mlp.c_fc.weight', 'text.transformer.resblocks.5.mlp.c_proj.bias ', 'text.transformer.resblocks.5.mlp.c_proj.weight', 'text.transformer.resblocks.6.attn.in_proj_bias', 'text.transformer.resblocks.6.attn.in_proj_weight', 'text.transformer.resblocks.6.attn.out_proj.bias', 'text.transformer.resblocks.6.attn.out_proj.weight', 'text.transformer.resblocks.6.ln_1.bias', 'text.transformer.resblocks.6.ln_1.weight', 'text.transformer.resblocks.6.ln_2. bias', 'text.transformer.resblocks.6.ln_2.weight', 'text.transformer.resblocks.6.mlp.c_fc.bias', 'text.transformer.resblocks.6.mlp.c_fc.weight', 'text.transformer.resblocks.6.mlp.c_proj.bias', 'text.transformer.resblocks.6.mlp.c_proj.weight', 'text.transformer.resblocks.7.attn.in_proj_bias', 'text.transformer.resblocks.7.attn.in_proj_weight', 'text.transformer.resblocks.7.attn. out_proj.bias', 'text.transformer.resblocks.7.attn.out_proj.weight', 'text.transformer.resblocks.7.ln_1.bias', 'text.transformer.resblocks.7.ln_1.weight', 'text.transformer.resblocks.7.ln_2.bias', 'text.transformer.resblocks.7.ln_2.weight', 'text.transformer.resblocks.7.mlp.c_fc.bias', 'text.transformer.resblocks.7.mlp.c_fc.weight', 'text.transformer.resblocks.7.mlp.c_proj.bias ', 'text.transformer.resblocks.7.mlp.c_proj.weight', 'text.transformer.resblocks.8.attn.in_proj_bias', 'text.transformer.resblocks.8.attn.in_proj_weight', 'text.transformer.resblocks.8.attn.out_proj.bias', 'text.transformer.resblocks.8.attn.out_proj.weight', 'text.transformer.resblocks.8.ln_1.bias', 'text.transformer.resblocks.8.ln_1.weight', 'text.transformer.resblocks.8.ln_2.bias', 'text.transformer.resblocks.8.ln_2.weight', 'text.transformer.resblocks.8.mlp.c_fc.bias', 'text.transformer.resblocks.8.mlp.c_fc.weight', 'text.transformer.resblocks.8.mlp.c_proj.bias', 'text.transformer.resblocks.8.mlp.c_proj.weight', 'text.transformer.resblocks.9.attn.in_proj_bias', 'text.transformer.resblocks.9.attn.in_proj_weight', 'text.transformer.resblocks.9.attn.out_proj.bias', 'text.transformer.resblocks.9.attn.out_proj.weight', 'text.transformer.resblocks.9.ln_1.bias', 'text.transformer.resblocks.9.ln_1.weight', 'text.transformer.resblocks.9.ln_2.bias', 'text.transformer.resblocks.9.ln_2.weight', 'text.transformer.resblocks.9.mlp.c_fc.bias', 'text.transformer.resblocks.9.mlp.c_fc.weight', 'text.transformer.resblocks.9.mlp.c_proj.bias', 'text.transformer.resblocks.9.mlp.c_proj.weight'] ```

At this point, my issue is cleared, and I will choose open_clip interface to load the models. Thank you again,

Best regards,

Beckschen commented 4 months ago

Thanks for the validation. I am happy that your issues are cleared. I will look into the HuggingFace for L2 variants.