Large-scale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person re-identification (ReID) has barely been explored. In addition, there is a lack of decent text descriptions in current ReID benchmarks. To address these issues, in this work, we propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID.
We propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. To our best knowledge, we are the first to extract identity-specific sequence features to replace the text features of CLIP. Meanwhile, we further design a Sequence-Specific Prompt (SSP) module to update the CLIP-Memory online.
We propose a Temporal Memory Diffusion (TMD) module to capture temporal information. The frame-level memories in a sequence first communicate with each other to extract temporal information. The temporal information is then further diffused to each token, and finally aggregated to obtain more robust temporal features.
Performance
Pretrained Models
[x] MARS : Model&Code PASSWORD: 1234
[x] LSVID : Model&Code PASSWORD: 1234
[x] iLIDS : Model&Code PASSWORD: 1234
t-SNE Visualization
conda create -n tfclip python=3.8
conda activate tfclip
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
pip install yacs
pip install timm
pip install scikit-image
pip install tqdm
pip install ftfy
pip install regex
Download the datasets (MARS, LS-VID and iLIDS-VID), and then unzip them to your_dataset_dir.
For example,if you want to run method on MARS, you need to modify the bottom of configs/vit_base.yml to
DATASETS:
NAMES: ('MARS')
ROOT_DIR: ('your_dataset_dir')
OUTPUT_DIR: 'your_output_dir'
Then, run
CUDA_VISIBLE_DEVICES=0 python train-main.py
For example, if you want to test methods on MARS, run
CUDA_VISIBLE_DEVICES=0 python eval-main.py
This project is based on CLIP-ReID and XCLIP. Thanks for these excellent works.
If you have any questions, please feel free to send an email to yuchenyang@mail.dlut.edu.cn or asuradayuci@gmail.com. .^_^.
If you find TF-CLIP useful for you, please consider citing :mega:
@article{tfclip,
Title={TF-CLIP: Learning Text-Free CLIP for Video-Based Person Re-identification},
Author = {Chenyang Yu, Xuehu Liu, Yingquan Wang, Pingping Zhang, Huchuan Lu},
Volume={38},
Number={7},
Pages = {6764-6772},
Year = {2024},
booktitle= = {AAAI}
}
TF_CLIP is released under the MIT License.