how to evaluate model outputs on testset

CMMMU-Benchmark commented 6 months ago

Hi,

Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.

Best wishes!

AdonLee072348 commented 5 months ago

Hi,

Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention.

Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?

val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 |

test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

XinrunDu commented 4 months ago

Hi, Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention. Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo?

val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 |

test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

Thank you for your reply, and for your interest in the CMMMU benchmark.

Regarding your inquiry about submitting your model to the leaderboard, there are a few details we need to confirm with you:

Do you have plans to open source your model, and should it be categorized under open source or private models?
Is your model name displayed on the leaderboard confirmed to be Marco-VL-Plus?

We appreciate your support and contribution to our work once again. Should you have any further questions or require additional assistance, please feel free to contact us at any time.

Best, CMMMU Team

AdonLee072348 commented 4 months ago

Hi, Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! Thank you for your attention. Best wishes!

Thanks for your reply! We have evaluated our model Marco-VL-Plus on EvalAI, and the accuracy results on val and test dataset are 43.44 and 40.69 respectively. Would you please consider showing our result in your repo? val result: | Subject | Correct Num | Entries Num | Acc | |--------------------------------+---------------+---------------+----------| | art_and_design | 58 | 88 | 0.659091 | | business | 31 | 126 | 0.246032 | | science | 71 | 204 | 0.348039 | | health_and_medicine | 83 | 153 | 0.542484 | | humanities_and_social_sciences | 46 | 85 | 0.541176 | | technology_and_engineering | 102 | 244 | 0.418033 | | all | 391 | 900 | 0.434444 | test result: {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}}], 'submission_result': {'test_split': {'Art & Design': 66.72777268560954, 'Business': 21.911573472041614, 'Science': 36.80834001603849, 'Health & Medicine': 46.27345844504022, 'Humanities & Social Sciences': 47.10982658959538, 'Technology & Engineering': 38.36583725622058, 'Overall': 40.69090909090909}

Thank you for your reply, and for your interest in the CMMMU benchmark.

Regarding your inquiry about submitting your model to the leaderboard, there are a few details we need to confirm with you:

Do you have plans to open source your model, and should it be categorized under open source or private models?

Is your model name displayed on the leaderboard confirmed to be Marco-VL-Plus?

We appreciate your support and contribution to our work once again. Should you have any further questions or require additional assistance, please feel free to contact us at any time.

Best, CMMMU Team

Thank you for your great job in CMMMU benchmark. We are currently still a private model, but will release it in the future. Yeah, our model name is Marco-VL-Plus.

shan23chen commented 1 month ago

Great project!

And would love to see whether you guys can provide the test answer key for a subset of the health and science partition. And hope to chat and see whether we can collaborate!

Thanks! Shan Chen PhD candidate @ Harvard AIM

CMMMU-Benchmark / CMMMU

how to evaluate model outputs on testset #2