KbsdJames / Omni-MATH

The official repository of the Omni-MATH benchmark.
41 stars 1 forks source link

Incomplete model generations in example file #1

Open wedu-nvidia opened 1 month ago

wedu-nvidia commented 1 month ago

Hello, I am currently working on a benchmark using this dataset and have noticed discrepancies in the results for llama-3.1 -70b. The repository only contains 100 examples, and the model's outputs ('model-generation') appear to be incomplete even for the 100 examples. Could I kindly request the full set of 4,428 model generations by llama-3.1-70b, including the complete outputs, so I can conduct a thorough comparison?

Thank you so much!

KbsdJames commented 1 week ago

Thank you very much for your attention and thorough exploration of our work. We apologize for our delayed response. During these days, we have been busy writing the technical report.

For your question, you are correct in your observations. We later identified issues with the inference code for several models regarding the setting of max_new_tokens. After adjusting the max_new_tokens to 2048, we evaluate the models once again. In response to your request, we have made all outputs and GPT evaluation files for Qwen2.5-MATH-72b-Instruct and Llama-3.1-70b-Instruct available for your reference. We have also released our technical report. If you have other concerns you can further refer to https://arxiv.org/abs/2410.07985

Thank you once again for your interest and valuable feedback.