Eval MGSM - Githubissues

facemyself commented 1 month ago

Thank you for this wonderful work. I would like to ask whether you set top_p and temperature during the evaluation. Under what settings did you get the MGSM results in your paper? Did you average multiple evaluations? I ran your code and found that the results had some errors, but not large. The avg difference was 0.3, and some languages were 2% lower.

HuangZixian commented 1 month ago

Hi, glad to see you reproduced our model!

We use greedy search to decode text, without setting top_p and temperature.

I think your reproduced results are acceptable. Since the test set of MGSM is relatively small with only 250 samples per language, it is normal for the scores on a single language to fluctuate slightly under different hyperparameters, but the average score of all languages is relatively stable.

We also observed this phenomenon when reproducing other baselines, so for the MGSM dataset, we can pay more attention to whether the average score is stable.

facemyself commented 1 month ago

Thank you for this wonderful work. Hello, could you please give me a copy of your evaluation code for MSVAMP, X-CSQA, and XNLI datasets? My email address is wen1591591@gmail.com or 1434800673@qq.com. Thank you very much.

CONE-MT / MindMerger

Eval MGSM #5