Reproducing results using CLIP weights

chenshunpeng commented 1 month ago

Hello, This is a very valuable work that opens the door to solving Visual Geo-localization using multimodal models. However, I encountered some issues while trying to perform inference using the weights provided in your Baidu Netdisk. First, when I used the CLIP-ResNet50 and CLIP-ResNet101 weights, I couldn't reproduce the results for the pitts30k dataset as mentioned in the paper (ProGeo(CNN)). On the other hand, I encountered errors when trying to perform inference with the CLIP-ViT-B-16 and CLIP-ViT-B-32 weights. Could it be that some of my parameters are incorrect, or do I need to modify additional parameters in the parser? I hope you can help me correct this. Thank you very much.

Results of ProGeo(CNN) from the paper: `R@1: 91.8, R@5: 97.4`

Parameters of RN101_best_model: --backbone CLIP-RN101 --resume_model /work/ccc/project/VPR2/ProGEO/model/RN101_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test --fc_output_dim 512

Complete parameters:

2024-09-02 14:31:00   eval.py --backbone CLIP-RN101 --resume_model /work/ccc/project/VPR2/ProGEO/model/RN101_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test --fc_output_dim 512
2024-09-02 14:31:00   Arguments: Namespace(L=2, M=10, N=5, alpha=30, augmentation_device='cuda', backbone='CLIP-RN101', batch_size=32, batch_size_stage1=512, brightness=0.7, cache_feature_folder=None, checkpoint_period_stage1=80, checkpoint_period_stage2=8, classifiers_lr=0.01, contrast=0.7, device='cuda', epochs_num=64, epochs_num_stage1=480, fc_output_dim=512, freeze_cnn=2, freeze_trans=6, groups_num=8, hue=0.5, image_size=512, infer_batch_size=64, iterations_per_epoch=10000, lr=1e-05, lr_stage1=0.01, min_images_per_class=10, num_preds_to_save=0, num_workers=8, output_folder='logs/default/2024-09-02_14-31-00', positive_dist_threshold=25, prompt_learners=None, random_resized_crop=0.5, resize_test_imgs=False, resume_model='/work/ccc/project/VPR2/ProGEO/model/RN101_best_model', resume_model_stage1=None, resume_train=None, saturation=0.7, save_dir='default', save_only_wrong_preds=False, seed=0, soft_triplet=False, test_set_folder='/work/ccc/datasets/pitts30k/images/test', train_all_layers=False, use_amp16=False)
2024-09-02 14:31:00   The outputs are being saved in logs/default/2024-09-02_14-31-00
2024-09-02 14:31:03   There are 1 GPUs and 64 CPUs.
2024-09-02 14:31:03   Loading model from /work/ccc/project/VPR2/ProGEO/model/RN101_best_model

Reproduced results: 2024-09-02 14:33:37 < test - #q: 6816; #db: 10000 >: R@1: 90.8, R@5: 96.0, R@10: 96.8, R@20: 97.5

Parameters of RN50_best_model: --backbone CLIP-RN50 --resume_model /work/ccc/project/VPR2/ProGEO/model/RN50_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test --fc_output_dim 1024

Complete parameters:

2024-09-02 14:54:20   eval.py --backbone CLIP-RN50 --resume_model /work/ccc/project/VPR2/ProGEO/model/RN50_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test --fc_output_dim 1024
2024-09-02 14:54:20   Arguments: Namespace(L=2, M=10, N=5, alpha=30, augmentation_device='cuda', backbone='CLIP-RN50', batch_size=32, batch_size_stage1=512, brightness=0.7, cache_feature_folder=None, checkpoint_period_stage1=80, checkpoint_period_stage2=8, classifiers_lr=0.01, contrast=0.7, device='cuda', epochs_num=64, epochs_num_stage1=480, fc_output_dim=1024, freeze_cnn=2, freeze_trans=6, groups_num=8, hue=0.5, image_size=512, infer_batch_size=64, iterations_per_epoch=10000, lr=1e-05, lr_stage1=0.01, min_images_per_class=10, num_preds_to_save=0, num_workers=8, output_folder='logs/default/2024-09-02_14-54-20', positive_dist_threshold=25, prompt_learners=None, random_resized_crop=0.5, resize_test_imgs=False, resume_model='/work/ccc/project/VPR2/ProGEO/model/RN50_best_model', resume_model_stage1=None, resume_train=None, saturation=0.7, save_dir='default', save_only_wrong_preds=False, seed=0, soft_triplet=False, test_set_folder='/work/ccc/datasets/pitts30k/images/test', train_all_layers=False, use_amp16=False)
2024-09-02 14:54:20   The outputs are being saved in logs/default/2024-09-02_14-54-20
2024-09-02 14:54:22   There are 1 GPUs and 64 CPUs.
2024-09-02 14:54:22   Loading model from /work/ccc/project/VPR2/ProGEO/model/RN50_best_model

Reproduced results: 2024-09-02 14:58:55 < test - #q: 6816; #db: 10000 >: R@1: 90.0, R@5: 95.4, R@10: 96.4, R@20: 97.2

Results of ProGeo(Transformer) from the paper: `R@1: 93.0, R@5: 98.3`

Parameters of VIT32_best_model: --backbone CLIP-ViT-B-32 --resume_model /work/ccc/project/VPR2/ProGEO/model/VIT32_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test

Complete parameters:

2024-09-01 21:25:15   eval.py --backbone CLIP-ViT-B-32 --resume_model /work/ccc/project/VPR2/ProGEO/model/VIT32_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test
2024-09-01 21:25:15   Arguments: Namespace(L=2, M=10, N=5, alpha=30, augmentation_device='cuda', backbone='CLIP-ViT-B-32', batch_size=32, batch_size_stage1=512, brightness=0.7, cache_feature_folder=None, checkpoint_period_stage1=80, checkpoint_period_stage2=8, classifiers_lr=0.01, contrast=0.7, device='cuda', epochs_num=64, epochs_num_stage1=480, fc_output_dim=1024, freeze_cnn=2, freeze_trans=6, groups_num=8, hue=0.5, image_size=512, infer_batch_size=64, iterations_per_epoch=10000, lr=1e-05, lr_stage1=0.01, min_images_per_class=10, num_preds_to_save=0, num_workers=8, output_folder='logs/default/2024-09-01_21-25-15', positive_dist_threshold=25, prompt_learners=None, random_resized_crop=0.5, resize_test_imgs=False, resume_model='/work/ccc/project/VPR2/ProGEO/model/VIT32_best_model', resume_model_stage1=None, resume_train=None, saturation=0.7, save_dir='default', save_only_wrong_preds=False, seed=0, soft_triplet=False, test_set_folder='/work/ccc/datasets/pitts30k/images/test', train_all_layers=False, use_amp16=False)
2024-09-01 21:25:15   The outputs are being saved in logs/default/2024-09-01_21-25-15
2024-09-01 21:25:18   There are 1 GPUs and 64 CPUs.
2024-09-01 21:25:18   Loading model from /work/ccc/project/VPR2/ProGEO/model/VIT32_best_model

Error message:

  0%|                                                                       | 0/157 [00:00<?, ?it/s]
  0%|                                                                       | 0/157 [00:08<?, ?it/s]
2024-09-01 21:25:41   
Traceback (most recent call last):
  File "eval.py", line 46, in <module>
    recalls, recalls_str = test.test(args, test_ds, model, args.num_preds_to_save)
  File "/work/ccc/project/VPR2/ProGEO/test.py", line 31, in test
    descriptors = model(images.to(args.device))
  File "/work/ccc/anaconda3/envs/ProGEO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/ccc/project/VPR2/ProGEO/cosplace_model/cosplace_network_stage2.py", line 61, in forward
    image_features = self.backbone(x)
  File "/work/ccc/anaconda3/envs/ProGEO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/ccc/project/VPR2/ProGEO/clip/model.py", line 226, in forward
    x = x + self.positional_embedding.to(x.dtype)
RuntimeError: The size of tensor a (301) must match the size of tensor b (257) at non-singleton dimension 1

2024-09-01 21:25:41   Experiment finished (with some errors)

Parameters of ViT16_best_model: --resume_model /work/ccc/project/VPR2/ProGEO/model/ViT16_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test --fc_output_dim 512

Complete parameters:

2024-09-02 14:45:57   eval.py --backbone CLIP-ViT-B-16 --resume_model /work/ccc/project/VPR2/ProGEO/model/ViT16_best_model --test_set_folder /work/ccc/datasets/pitts30k/images/test --fc_output_dim 512
2024-09-02 14:45:57   Arguments: Namespace(L=2, M=10, N=5, alpha=30, augmentation_device='cuda', backbone='CLIP-ViT-B-16', batch_size=32, batch_size_stage1=512, brightness=0.7, cache_feature_folder=None, checkpoint_period_stage1=80, checkpoint_period_stage2=8, classifiers_lr=0.01, contrast=0.7, device='cuda', epochs_num=64, epochs_num_stage1=480, fc_output_dim=512, freeze_cnn=2, freeze_trans=6, groups_num=8, hue=0.5, image_size=512, infer_batch_size=64, iterations_per_epoch=10000, lr=1e-05, lr_stage1=0.01, min_images_per_class=10, num_preds_to_save=0, num_workers=8, output_folder='logs/default/2024-09-02_14-45-57', positive_dist_threshold=25, prompt_learners=None, random_resized_crop=0.5, resize_test_imgs=False, resume_model='/work/ccc/project/VPR2/ProGEO/model/ViT16_best_model', resume_model_stage1=None, resume_train=None, saturation=0.7, save_dir='default', save_only_wrong_preds=False, seed=0, soft_triplet=False, test_set_folder='/work/ccc/datasets/pitts30k/images/test', train_all_layers=False, use_amp16=False)
2024-09-02 14:45:57   The outputs are being saved in logs/default/2024-09-02_14-45-57
2024-09-02 14:45:59   There are 1 GPUs and 64 CPUs.
2024-09-02 14:45:59   Loading model from /work/ccc/project/VPR2/ProGEO/model/ViT16_best_model

Error message:

  0%|                                                                       | 0/157 [00:00<?, ?it/s]
  0%|                                                                       | 0/157 [00:01<?, ?it/s]
2024-09-02 14:46:02   
Traceback (most recent call last):
  File "eval.py", line 46, in <module>
    recalls, recalls_str = test.test(args, test_ds, model, args.num_preds_to_save)
  File "/work/ccc/project/VPR2/ProGEO/test.py", line 31, in test
    descriptors = model(images.to(args.device))
  File "/work/ccc/anaconda3/envs/ProGEO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/ccc/project/VPR2/ProGEO/cosplace_model/cosplace_network_stage2.py", line 59, in forward
    image_features = self.backbone(x)
  File "/work/ccc/anaconda3/envs/ProGEO/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/ccc/project/VPR2/ProGEO/clip/model.py", line 226, in forward
    x = x + self.positional_embedding.to(x.dtype)
RuntimeError: The size of tensor a (1201) must match the size of tensor b (1025) at non-singleton dimension 1

2024-09-02 14:46:02   Experiment finished (with some errors)

Chain-Mao commented 1 month ago

Hello, thanks for visiting my code and commenting.

I'm sure that the results is reproducible. But I'm not sure weather my assistant put the best model into the netdisk, you can train the model again if you are insterested. I didn't spend much time tuning the parameters while I was doing the experiments, maybe you can get a better result.
If you are using VIT as the backbone, you need add a parameter "--resize_test_imgs" in the command. Because VIT only accept fixed size image. Just like that "python3 eval.py --backbone CLIP-ViT-B-16 --resume_model /data1/CosPlace/logs/default/stage2/VIT16_nofreeze/best_model.pth --test_set_folder /data3/VPR-datasets-downloader/msls/val --resize_test_imgs --infer_batch_size 128 --fc_output_dim 512". If you have any other problem, please let me know.

chenshunpeng commented 1 month ago

Sure, based on your suggestion, I performed inference on the pitts30k dataset using the CLIP-ViT-B-16 and CLIP-ViT-B-32 weights, with the results as follows:

VIT32_best_model: < test - #q: 6816; #db: 10000 >: R@1: 89.1, R@5: 95.1, R@10: 96.5, R@20: 97.5
ViT16_best_model: < test - #q: 6816; #db: 10000 >: R@1: 90.6, R@5: 96.1, R@10: 97.3, R@20: 98.3

If there are any new weights available, I hope you can update them promptly as it would be beneficial to the community. I will also attempt to reproduce the results myself. Thank you for your response.

Chain-Mao commented 1 month ago

All right, I will check the weight problem if i have spare time.

Chain-Mao / ProGEO