Open wedu-nvidia opened 1 month ago
Thank you very much for your attention and thorough exploration of our work. We apologize for our delayed response. During these days, we have been busy writing the technical report.
For your question, you are correct in your observations. We later identified issues with the inference code for several models regarding the setting of max_new_tokens
. After adjusting the max_new_tokens
to 2048, we evaluate the models once again. In response to your request, we have made all outputs and GPT evaluation files for Qwen2.5-MATH-72b-Instruct and Llama-3.1-70b-Instruct available for your reference. We have also released our technical report. If you have other concerns you can further refer to https://arxiv.org/abs/2410.07985
Thank you once again for your interest and valuable feedback.
Hello, I am currently working on a benchmark using this dataset and have noticed discrepancies in the results for llama-3.1 -70b. The repository only contains 100 examples, and the model's outputs ('model-generation') appear to be incomplete even for the 100 examples. Could I kindly request the full set of 4,428 model generations by llama-3.1-70b, including the complete outputs, so I can conduct a thorough comparison?
Thank you so much!