anosorae / IRRA

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval (CVPR 2023)
MIT License
205 stars 27 forks source link

Confusion about the multimodal interaction encoder and mlm task #17

Closed qjyyyy closed 1 year ago

qjyyyy commented 1 year ago

In your paper, you mentioned that all parameters in the multimodal interaction encoder are randomly initialized. In fact, there are a lot of parameters in this part. I would like to ask if you have considered using the parameters of the CLIP encoder for initialization, as this may have an impact on the performance of the model. Also, I would like to ask what is the approximate accuracy of the mlm task (mlm_acc) in the end? I didn't run the entire code because my graphics card and PyTorch version are not supported.

qjyyyy commented 1 year ago

Thank you again for sharing such a greate idea and code.

anosorae commented 1 year ago

We do not try to use the parameters of the CLIP encoder for initializing our multimodal interaction encoder, and the mlm_acc during the training can be found in our released training logs on github.

qjyyyy commented 1 year ago

Thank you. :D