Model performance and evaluation metrics in the OmniMedVQA dataset

Thanks for your work! After reading the paper OmniMedVQA, I have two questions and sincerely look forward to the answers.

From the paper of MedVInT and RadFM, the dataset used in the radfm model is larger than that of medvint (16M vs. 1.64M). However, the performance of medvint is better than radfm in your paper. Do you further analyze the prediction results of the two models?
QA scores and prefix-based scores are distributed differently across image modalities. Which metric is more useful when selecting a model under a certain modality?

OpenGVLab / Multi-Modality-Arena