This is the official PyTorch implementation of the paper [MARS: Paying more attention to visual attributes for text-based person search](). This repository supports training and evaluation on three text-based person search benchmarks: CUHK-PEDES, ICFG-PEDES and RSTPReid.
For datasets preparation and download, please refer to RaSA.
Inside the shell folder there are the script for each training.
To train our model just choose a dataset and do:
# 1. Training on CUHK-PEDES
bash shell/cuhk_train.sh
# 2. Training on ICFG-PEDES
bash shell/icfg_train.sh
# 3. Training on RSTPReid
bash shell/rstp_train.sh
Before training, please update dataset location inside each .yaml
file.
Inside the shell folder there are the script to test each model.
# 1. Testing on CUHK-PEDES
bash shell/cuhk-eval.sh
# 2. Testing on ICFG-PEDES
bash shell/icfg-eval.sh
# 3. Testing on RSTPReid
bash shell/rstp-eval.sh
Before testing, please update the checkpoint location inside each .sh
file.
Method | Rank-1 | Rank-5 | Rank-10 | mAP |
---|---|---|---|---|
CMPM/C | 49.37 | 71.69 | 79.27 | - |
ViTAA | 55.97 | 75.84 | 83.52 | - |
DSSL | 59.98 | 80.41 | 87.56 | - |
SAF | 64.13 | 82.62 | 88.40 | 58.61 |
LGUR | 65.25 | 83.12 | 89.00 | - |
IVT | 65.59 | 83.11 | 89.21 | - |
CFine | 69.57 | 85.93 | 91.15 | - |
ALBEF | 60.28 | 79.52 | 86.34 | 56.67 |
RaSa | 76.51 | 90.29 | 94.25 | 69.38 |
MARS (ours) | 77.62 | 90.63 | 94.27 | 71.41 |
Method | Rank-1 | Rank-5 | Rank-10 | mAP |
---|---|---|---|---|
CMPM/C | 43.51 | 65.44 | 74.26 | - |
SSAN | 54.23 | 72.63 | 79.53 | - |
SAF | 54.86 | 72.13 | 79.13 | 32.76 |
IVT | 56.04 | 73.60 | 80.22 | - |
CFine | 60.83 | 76.55 | 82.42 | - |
ALBEF | 34.46 | 52.32 | 60.40 | 19.62 |
RaSa | 65.28 | 80.40 | 85.12 | 41.29 |
MARS (ours) | 67.60 | 81.47 | 85.79 | 44.93 |
Method | Rank-1 | Rank-5 | Rank-10 | mAP |
---|---|---|---|---|
DSSL | 32.43 | 55.08 | 63.19 | - |
SSAN | 43.50 | 67.80 | 77.15 | - |
SAF | 44.05 | 67.30 | 76.25 | 36.81 |
IVT | 46.70 | 70.00 | 78.80 | - |
CFine | 50.55 | 72.50 | 81.60 | - |
ALBEF | 50.10 | 73.70 | 82.10 | 41.73 |
RaSa | 66.90 | 86.50 | 91.35 | 52.31 |
MARS (ours) | 67.55 | 86.55 | 91.35 | 52.92 |
Visual comparison of cross attention maps generated by the baseline model (top) and our model (bottom) using Grad-CAM. The attention maps illustrate the cross-modal encoder focus on different regions corresponding to individual words in the attribute chunks. The proposed attribute loss leads to more consistent and accurate attention distribution across words.
Overview of comparison between top 10 predictions of baseline and our model. Predicted images are ranked from left (i.e., position 1) to the right (i.e., position 10). Our model outperforms the baseline in several pairs, i.e., a,b,c,d. In pair c it is possible to observe how all predictions are with a bike in it, while this is not true in the baseline. Furthermore, even if in pair e our model does not predict the second position correctly, it is easy to observe how a higher mAP is achieve by providing 3 correct matches in top 10 positions compared to 2 correct matches in top 10 of the baseline. Lastly, in pair f our model is not able to predict any correct image due to the vagueness of the caption, but is still retrieving images closely related to the text.
The implementation of MARS relies on resources from RaSA, Huggingface Transformers, and timm. We sincerely appreciate the original authors for their open-sourcing.
If you find this code useful for your research, please cite our paper.
@misc{ergasti2024marspayingattentionvisual,
title={MARS: Paying more attention to visual attributes for text-based person search},
author={Alex Ergasti and Tomaso Fontanini and Claudio Ferrari and Massimo Bertozzi and Andrea Prati},
year={2024},
eprint={2407.04287},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.04287},
}