Closed scholarstree closed 10 months ago
I'm sorry I didn't see your problem until now. For question 1: We directly used the pretrained version of BLIP. For question 2: The thought process might be correct. I think that based on the way we produce MALS, APTM can naturally learn something from stable diffusion and BLIP. For question 3: We do have APTM comparison results with BLIP but only finetuned one. In other words, we finetuned BLIP with CUHK-PEDES. At least in our experiment, it seems that its performance in pedestrian retrieval is not as good as APTM.
For the two approaches: If you have correct person attributes, maybe you should try the second one. Because many studies on pedestrian attributes have also confirmed that pedestrian attributes can help pedestrians re-identification. But the best way is to try both methods.
These are just my opinions, and I don't know if they are correct. Hope they can help you.
Hi, I have tried APTM it is awesome!
Now, I want to finetune APTM for text-based person retrieval task over custom data which contains more person attributes, like different poses and more objects. I'm trying to understand how important attribute learning is, and whether I can finetune BLIP instead of APTM.
In the paper, it is mentioned that "BLIP is used to produce more fitting captions for every synthetic image and form the final image-text pairs" in MALS dataset. I have following questions regarding the approach:
What approach would you suggest if I want to APTM to understand more person attributes, like different poses and more objects?
Thanks! Please correct me if I understood anything incorrectly.