Closed qjyyyy closed 1 year ago
Thank you again for sharing such a greate idea and code.
We do not try to use the parameters of the CLIP encoder for initializing our multimodal interaction encoder, and the mlm_acc during the training can be found in our released training logs on github.
Thank you. :D
In your paper, you mentioned that all parameters in the multimodal interaction encoder are randomly initialized. In fact, there are a lot of parameters in this part. I would like to ask if you have considered using the parameters of the CLIP encoder for initialization, as this may have an impact on the performance of the model. Also, I would like to ask what is the approximate accuracy of the mlm task (mlm_acc) in the end? I didn't run the entire code because my graphics card and PyTorch version are not supported.