VamosC / CLIP4STR

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".
Apache License 2.0
90 stars 12 forks source link

bad filename format for base checkpoint pre-trained on DC-1B #13

Closed justinTM closed 2 months ago

justinTM commented 2 months ago

hi guys, awesome work thanks for implementing and providing models

i noticed check for base32x32 in key but base checkpoint clip4str_base_6e9fe947ac.pt doesn't satisfy the naming requirement if passed as a filepath:

https://github.com/VamosC/CLIP4STR/blob/d18f2f4b98b7e3dc1a59a845a6940997a4e9c09c/strhub/models/utils.py#L68

justinTM commented 2 months ago

hi @mzhaoshuai @VamosC sorry for troubling you. I am new to Transformers and PyTorch and I am learning fast, but having difficulties loading the *.pt* checkpoints for inference.

here is how far i have gotten. i think i am completely misunderstanding how loading.

any advice on how to load the CLIP4STR *.pt checkpoint for inference? you can point me in the right direction with general guidance and i can work on learning the specific details so you don't have to waste your time

justinTM commented 2 months ago

i fixed the above error and got the model to load and return something

see changes here: https://github.com/justinTM/CLIP4STR/commit/937a2eb9bd5f1c1895cdf1023642aaacb4f1751a

i changed the parameters around in the state_dict. but now the inference merely returns padding characters [P]:

> python code/read.py checkpoints/pretrained/clip/clip4str_large_3c9d881b88.pt --images_path=./images/ --device=mps
Additional keyword arguments: {}

 config of VL4STR: 
         image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False 
         use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0 
         use_share_dim: True, image_detach: True, clip_cls_eot_feature: False 
         cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False 

>>> Try to load CLIP model from checkpoints/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin

 config of VL4STR: 
         image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False 
         use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0 
         use_share_dim: True, image_detach: True, clip_cls_eot_feature: False 
         cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False 

loading checkpoint from checkpoints/pretrained/clip/clip4str_large_3c9d881b88.pt
The dimension of the visual decoder is 768.
/Users/me/git/hf/spaces/justinTM/CLIP4STR/.devbox/virtenv/python/.venv/lib/python3.10/site-packages/torch/nn/functional.py:5137: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
  warnings.warn(
MCK1-0PF_2-EXIT_24-04-19_14-20-22.87.jpg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
Untitled 13.jpg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
mzhaoshuai commented 2 months ago

@justinTM Hi, Can you provide a more detailed error log?

Do you set https://github.com/VamosC/CLIP4STR/blob/d18f2f4b98b7e3dc1a59a845a6940997a4e9c09c/strhub/models/vl_str/system.py#L22 properly? It is now my path, and you should change it to your path.

The clip_pretrained in all *.yaml config files should be the original CLIP pre-trained model, rather than the provided clip4str STR models.

By the way, it should work without modifying the state_dict. Let's first try to make it work with code in this repo.

justinTM commented 2 months ago

hi @mzhaoshuai thank you for reply!

i'll to put as much info as i can here.

here is the full output of the command

❯ python code/read.py pretrained/clip/clip4str_large_3c9d881b88.pt --images_path code/misc/test_image --device=mps
Additional keyword arguments: {}

 config of VL4STR: 
         image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False 
         use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0 
         use_share_dim: True, image_detach: True, clip_cls_eot_feature: False 
         cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False 

>>> Try to load CLIP model from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
loading checkpoint from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin
The dimension of the visual decoder is 768.
/Users/justin/git/hf/spaces/justinTM/CLIP4STR/.devbox/virtenv/python/.venv/lib/python3.10/site-packages/torch/nn/functional.py:5137: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
  warnings.warn(
image_1576.jpeg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
justinTM commented 2 months ago

i also tried original base16 checkpoint:

ran command:

❯ python code/read.py pretrained/clip/clip4str_base16x16_d70bde1f2d.ckpt --images_path code/misc/test_image --device=mps
Additional keyword arguments: {}

 config of VL4STR: 
         image_freeze_nlayer: 0, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False 
         use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0 
         use_share_dim: True, image_detach: True, clip_cls_eot_feature: False 
         cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False 

>>> Try to load CLIP model from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/ViT-B-16.pt
loading checkpoint from /Users/justin/git/hf/spaces/justinTM/CLIP4STR/pretrained/clip/ViT-B-16.pt
The dimension of the visual decoder is 512.
/Users/justin/git/hf/spaces/justinTM/CLIP4STR/.devbox/virtenv/python/.venv/lib/python3.10/site-packages/torch/nn/functional.py:5137: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
  warnings.warn(
image_1576.jpeg: [P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P][P]
mzhaoshuai commented 2 months ago

@justinTM So, you run the code normally but the output of the model for your test images are padding tokens?

May I see your test images?

I run the code again

bash read.sh 4 clip4str_base_6e9fe947ac.pt ~/code/CLIP4STR/misc/test_image
Additional keyword arguments: {}

 config of VL4STR:
         image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
         use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
         use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
         cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

loading checkpoint from /home/shuai/pretrained/clip/OpenCLIP-ViT-B-16-DataComp-XL-s13B-b90K.bin
The dimension of the visual decoder is 512.
image_1576.jpeg: Chicken

Do you get the Chicken results for the provided test image? If so, I think it is just that CLIP4STR does not work for your images.

justinTM commented 2 months ago

hi @mzhaoshuai yes i am using the built-in train image, and getting only padding characters as output.

is this below the correct procedures?

  1. download OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin to pretrain/clip/
  2. set clip_pretrained to above filepath in configs/experiment/vl4str-large.yaml#L16
mzhaoshuai commented 2 months ago

hi @mzhaoshuai yes i am using the built-in train image, and getting only padding characters as output.

is this below the correct procedures?

  1. download OpenCLIP-ViT-L-14-DataComp-XL-s13B-b90K.bin to pretrain/clip/
  2. set clip_pretrained to above filepath in configs/experiment/vl4str-large.yaml#L16

Sorry for the late and discontinuous replies. I just fly in the sky for roughly 20 hours....

Back to the problems, It looks all good, except for the results.

Why you set --device=mps? Do you run CLIP4STR on a MacOS machine? I never test such a feature. It should work, but I am not sure.

justinTM commented 2 months ago

hi @mzhaoshuai no problem at all, thank you for help whenever you are able!

yes I am on macOS. AHA! yes, setting --device=cpu fixed it :)

initially, without any device flag, it errored due to no CUDA, so I set for macOS Metal. I should have tried CPU!

for background, we are accepted into nvidia Inception program (i.e., AWS credits), so I will get CUDA-enabled EC2 instances soon, maybe it will work. currently I am using AWS Rekognition for scene text recognition, but I cannot find any papers or benchmarks for it and would like to explore alternatives, like your state-of-the-art here.

nice work on the ghost sentences by the way, it is interesting research.