Closed justinTM closed 2 months ago
hi @mzhaoshuai @VamosC sorry for troubling you. I am new to Transformers and PyTorch and I am learning fast, but having difficulties loading the *.pt*
checkpoints for inference.
here is how far i have gotten. i think i am completely misunderstanding how loading.
read.py
, i am loading checkpoint checkpoints_RBU/pretrained/clip/clip4str_large_3c9d881b88.pt
KeyError: 'visual.layer1.0.conv1.weight'
vit = "visual.proj" in state_dict
evaluates to False
, triggering the conditionalvisual.proj
exists under the parameter namespace clip_model
:
In [49]: [k for k in checkpoint['state_dict'].keys() if "visual.proj" in k]
Out[49]: ['clip_model.visual.proj']
any advice on how to load the CLIP4STR *.pt
checkpoint for inference? you can point me in the right direction with general guidance and i can work on learning the specific details so you don't have to waste your time
i fixed the above error and got the model to load and return something
see changes here: https://github.com/justinTM/CLIP4STR/commit/937a2eb9bd5f1c1895cdf1023642aaacb4f1751a
i changed the parameters around in the state_dict
. but now the inference merely returns padding characters [P]
:
> python code/read.py checkpoints/pretrained/clip/clip4str_large_3c9d881b88.pt --images_path=./images/ --device=mps
Additional keyword arguments: {}
config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False
>>> Try to load CLIP model from checkpoints/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False
loading checkpoint from checkpoints/pretrained/clip/clip4str_large_3c9d881b88.pt
The dimension of the visual decoder is 768.
/Users/me/git/hf/spaces/justinTM/CLIP4STR/.devbox/virtenv/python/.venv/lib/python3.10/site-packages/torch/nn/functional.py:5137: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
warnings.warn(
MCK1-0PF_2-EXIT_24-04-19_14-20-22.87.jpg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
Untitled 13.jpg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
@justinTM Hi, Can you provide a more detailed error log?
Do you set https://github.com/VamosC/CLIP4STR/blob/d18f2f4b98b7e3dc1a59a845a6940997a4e9c09c/strhub/models/vl_str/system.py#L22 properly? It is now my path, and you should change it to your path.
The clip_pretrained
in all *.yaml
config files should be the original CLIP pre-trained model, rather than the provided clip4str
STR models.
By the way, it should work without modifying the state_dict
.
Let's first try to make it work with code in this repo.
hi @mzhaoshuai thank you for reply!
i'll to put as much info as i can here.
i reverted all code to the origin main commit here
this is my directory structure:
❯ tree . -L 2
.
├── code
│ ├── LICENSE
│ ├── README.md
│ ├── bench.py
│ ├── configs
│ ├── hubconf.py
│ ├── misc
│ ├── read.py
│ ├── requirements.txt
│ ├── scripts
│ ├── strhub
│ ├── test.py
│ ├── tools
│ ├── train.py
│ └── tune.py
├── dataset
├── images
│ ├── MCK1-0PF_2-EXIT_24-04-19_14-20-22.87.jpg
│ └── Untitled 13.jpg
├── output
├── pretrained
│ └── clip
│ ├── OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
│ ├── README.md
│ ├── ViT-L-14.pt
│ ├── clip4str_base_6e9fe947ac.pt
│ ├── clip4str_base_6e9fe947ac_log.txt
│ ├── clip4str_huge_3e942729b1.pt
│ ├── clip4str_huge_3e942729b1_log.txt
│ ├── clip4str_huge_5eef9f86e2.pt
│ ├── clip4str_huge_5eef9f86e2_log.txt
│ ├── clip4str_large_3c9d881b88.pt
│ └── clip4str_large_3c9d881b88_log.txt
└── state_dict.txt
17 directories, 16 files
CLIP_PATH = '/Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip'
in strhub/models/vl_str/system.py
i downloaded and renamed properly OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin then placed in ABS_ROOT/pretrained/clip
i set clip_pretrained
to this /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
in https://github.com/justinTM/CLIP4STR/blob/main/configs/experiment/vl4str-large.yaml#L16
here is the full output of the command
❯ python code/read.py pretrained/clip/clip4str_large_3c9d881b88.pt --images_path code/misc/test_image --device=mps
Additional keyword arguments: {}
config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False
>>> Try to load CLIP model from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
loading checkpoint from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
The dimension of the visual decoder is 768.
/Users/justin/git/hf/spaces/justinTM/CLIP4STR/.devbox/virtenv/python/.venv/lib/python3.10/site-packages/torch/nn/functional.py:5137: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
warnings.warn(
image_1576.jpeg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
i also tried original base16 checkpoint:
ViT-B-16.pt
clip4str_base16x16_d70bde1f2d.ckpt
clip_pretrained: /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/ViT-B-16.pt
ran command:
❯ python code/read.py pretrained/clip/clip4str_base16x16_d70bde1f2d.ckpt --images_path code/misc/test_image --device=mps
Additional keyword arguments: {}
config of VL4STR:
image_freeze_nlayer: 0, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False
>>> Try to load CLIP model from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/ViT-B-16.pt
loading checkpoint from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/ViT-B-16.pt
The dimension of the visual decoder is 512.
/Users/justin/git/hf/spaces/justinTM/CLIP4STR/.devbox/virtenv/python/.venv/lib/python3.10/site-packages/torch/nn/functional.py:5137: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
warnings.warn(
image_1576.jpeg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
@justinTM So, you run the code normally but the output of the model for your test images are padding tokens?
May I see your test images?
I run the code again
bash read.sh 4 clip4str_base_6e9fe947ac.pt ~/code/CLIP4STR/misc/test_image
Additional keyword arguments: {}
config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False
loading checkpoint from /home/shuai/pretrained/clip/OpenCLIP-ViT-B-16-DataComp-XL-s13B-b90K.bin
The dimension of the visual decoder is 512.
image_1576.jpeg: Chicken
Do you get the Chicken
results for the provided test image?
If so, I think it is just that CLIP4STR does not work for your images.
hi @mzhaoshuai yes i am using the built-in train image, and getting only padding characters as output.
is this below the correct procedures?
OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
to pretrain/clip/
clip_pretrained
to above filepath in configs/experiment/vl4str-large.yaml#L16hi @mzhaoshuai yes i am using the built-in train image, and getting only padding characters as output.
is this below the correct procedures?
- download
OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
topretrain/clip/
- set
clip_pretrained
to above filepath in configs/experiment/vl4str-large.yaml#L16
Sorry for the late and discontinuous replies. I just fly in the sky for roughly 20 hours....
Back to the problems, It looks all good, except for the results.
Why you set --device=mps
? Do you run CLIP4STR on a MacOS machine?
I never test such a feature. It should work, but I am not sure.
hi @mzhaoshuai no problem at all, thank you for help whenever you are able!
yes I am on macOS. AHA! yes, setting --device=cpu
fixed it :)
initially, without any device
flag, it errored due to no CUDA, so I set for macOS Metal. I should have tried CPU!
for background, we are accepted into nvidia Inception program (i.e., AWS credits), so I will get CUDA-enabled EC2 instances soon, maybe it will work. currently I am using AWS Rekognition for scene text recognition, but I cannot find any papers or benchmarks for it and would like to explore alternatives, like your state-of-the-art here.
nice work on the ghost sentences by the way, it is interesting research.
hi guys, awesome work thanks for implementing and providing models
i noticed check for
base32x32
in key but base checkpointclip4str_base_6e9fe947ac.pt
doesn't satisfy the naming requirement if passed as a filepath:https://github.com/VamosC/CLIP4STR/blob/d18f2f4b98b7e3dc1a59a845a6940997a4e9c09c/strhub/models/utils.py#L68