ChenDelong1999 / RemoteCLIP

🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
https://arxiv.org/abs/2306.11029
Apache License 2.0
315 stars 22 forks source link
contrastive-language-image-pretraining remote-sensing vision-language
## [RemoteCLIP🛰️: A Vision Language Foundation Model for Remote Sensing](https://arxiv.org/abs/2306.11029) [Fan Liu (刘凡)](https://multimodality.group/author/%E5%88%98%E5%87%A1/)✉ * Logo,     [Delong Chen (陈德龙)](https://chendelong.world/)✉ * Logo,     [Zhangqingyun Guan (管张青云)](https://github.com/gzqy1026) Logo [Xiaocong Zhou (周晓聪)](https://multimodality.group/author/%E5%91%A8%E6%99%93%E8%81%AA/) Logo,     [Jiale Zhu (朱佳乐)](https://multimodality.group/author/%E6%9C%B1%E4%BD%B3%E4%B9%90/) Logo,     [Qiaolin Ye (业巧林)](https://it.njfu.edu.cn/szdw/20181224/i14059.html) Logo,     Liyong Fu (符利勇) Logo,     [Jun Zhou (周峻)](https://experts.griffith.edu.au/7205-jun-zhou) Logo Logo         Logo         Logo         Logo         Logo \* *Equal Contribution*

News

Introduction

Welcome to the official repository of our paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing"!

General-purpose foundation models have become increasingly important in the field of artificial intelligence. While self-supervised learning (SSL) and Masked Image Modeling (MIM) have led to promising results in building such foundation models for remote sensing, these models primarily learn low-level features, require annotated data for fine-tuning, and are not applicable for retrieval and zero-shot applications due to the lack of language understanding.

In response to these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics, as well as aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling, converting heterogeneous annotations based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion, and further incorporating UAV imagery, resulting in a 12xlarger pretraining dataset.

RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting. Evaluations on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, show that RemoteCLIP consistently outperforms baseline foundation models across different model scales.

Impressively, RemoteCLIP outperforms previous SoTA by 9.14% mean recall on the RSICD dataset and by 8.92% on RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets.

Load RemoteCLIP

RemoteCLIP is trained with the ITRA codebase, and we have converted the pretrained checkpoints to OpenCLIP compatible format and uploaded them to [this Huggingface Repo], such that accessing the model could be more convenient!

Retrieval Evaluation

To perform cross-modal retrieval with RemoteCLIP, we extract image and text representations on the test split, perform L-2 normalization, and retrieval most similar samples based on the dot-product similarity measure. We show the retrieval recall of top-1 (R@1), top-5 (R@5), top-10 (R@10), and the mean recall of these values.

We have prepared a retrieval.py script to replicate the retrieval evaluation. Follow the steps below to evaluate the retrieval performance of RemoteCLIP on the RSITMD, RSICD, and UCM datasets:

Acknowledgments

Citation

If you find this work useful, please cite our paper as:

@article{remoteclip,
  author       = {Fan Liu and
                  Delong Chen and
                  Zhangqingyun Guan and
                  Xiaocong Zhou and
                  Jiale Zhu and
                  Qiaolin Ye and
                  Liyong Fu and
                  Jun Zhou},
  title        = {RemoteCLIP: {A} Vision Language Foundation Model for Remote Sensing},
  journal      = {{IEEE} Transactions on Geoscience and Remote Sensing},
  volume       = {62},
  pages        = {1--16},
  year         = {2024},
  url          = {https://doi.org/10.1109/TGRS.2024.3390838},
  doi          = {10.1109/TGRS.2024.3390838},
}