Shuyu-XJTU / APTM

The official code of "Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark"
https://arxiv.org/abs/2306.02898
MIT License
139 stars 12 forks source link

Comparison with BLIP and importance of Attribute Learning #9

Closed scholarstree closed 10 months ago

scholarstree commented 1 year ago

Hi, I have tried APTM it is awesome!

Now, I want to finetune APTM for text-based person retrieval task over custom data which contains more person attributes, like different poses and more objects. I'm trying to understand how important attribute learning is, and whether I can finetune BLIP instead of APTM.

In the paper, it is mentioned that "BLIP is used to produce more fitting captions for every synthetic image and form the final image-text pairs" in MALS dataset. I have following questions regarding the approach:

  1. Did you use the pretrained version of BLIP or did you finetune it over some person image-text pairs before labelling MARS dataset?
  2. If you're using BLIP generated captions as ground truth and then pretraining APTM, doesn't that mean pretrained APTM is trying to achieve BLIP performance in pretraining phase? Perhaps in the finetuning phase, APTM may be better than BLIP. Is this thought process correct?
  3. Do you have APTM comparison results with BLIP (either pretrained or fine-tuned over some person image-text pairs)?

What approach would you suggest if I want to APTM to understand more person attributes, like different poses and more objects?

  1. Finetune for ITC + ITM + MLM. This would require only image-text pairs. (Can be done be BLIP too)
  2. Finetune for IAC + IAM + MAM + ITC + ITM + MLM. This would require attribute labels to be prepared.

Thanks! Please correct me if I understood anything incorrectly.

Shuyu-XJTU commented 1 year ago

I'm sorry I didn't see your problem until now. For question 1: We directly used the pretrained version of BLIP. For question 2: The thought process might be correct. I think that based on the way we produce MALS, APTM can naturally learn something from stable diffusion and BLIP. For question 3: We do have APTM comparison results with BLIP but only finetuned one. In other words, we finetuned BLIP with CUHK-PEDES. At least in our experiment, it seems that its performance in pedestrian retrieval is not as good as APTM.

For the two approaches: If you have correct person attributes, maybe you should try the second one. Because many studies on pedestrian attributes have also confirmed that pedestrian attributes can help pedestrians re-identification. But the best way is to try both methods.

These are just my opinions, and I don't know if they are correct. Hope they can help you.