Inconsistent Evaluation Results on Slake1.0 and PathVQA Datasets

Hello,

I attempted to replicate the evaluation results presented in the paper using two datasets: Slake1.0 and PathVQA. For this process, I utilized the released data at the provided URL: https://github.com/UCSC-VLAA/MedTrinity-25M/issues/6#issuecomment-2345623211. However, my results do not match those reported in the paper. Below are the details:

Slake1.0 Dataset: It appears that the checkpoint provided is finetuned without pretraining on the MedTrinity-25M dataset, as my results are very close to the results of LLaVA-Med++ (Ours, w/o) in Table 3 of the paper.
PathVQA Dataset: For the Closed set, I was able to replicate the accuracy as expected. However, in the Open set, the recall was significantly lower than the published results.

To help diagnose these issues, I have attached two images for reference. Each image corresponds to the evaluation process on the two datasets mentioned above.

Could you kindly verify whether the provided fine-tuning checkpoint for Slake1.0 is correct? Additionally, it would be helpful to understand any specific steps necessary to replicate the reported recall values for the PathVQA Open set.

Thank you in advance for your assistance!

UCSC-VLAA / MedTrinity-25M

Inconsistent Evaluation Results on Slake1.0 and PathVQA Datasets #9