CONE-MT / MindMerger

MIT License
19 stars 0 forks source link

Eval MGSM #5

Open facemyself opened 1 month ago

facemyself commented 1 month ago

Thank you for this wonderful work. I would like to ask whether you set top_p and temperature during the evaluation. Under what settings did you get the MGSM results in your paper? Did you average multiple evaluations? I ran your code and found that the results had some errors, but not large. The avg difference was 0.3, and some languages ​​were 2% lower.

HuangZixian commented 1 month ago

Hi, glad to see you reproduced our model!

We use greedy search to decode text, without setting top_p and temperature.

I think your reproduced results are acceptable. Since the test set of MGSM is relatively small with only 250 samples per language, it is normal for the scores on a single language to fluctuate slightly under different hyperparameters, but the average score of all languages ​​is relatively stable.

We also observed this phenomenon when reproducing other baselines, so for the MGSM dataset, we can pay more attention to whether the average score is stable.

facemyself commented 1 month ago

Thank you for this wonderful work. Hello, could you please give me a copy of your evaluation code for MSVAMP, X-CSQA, and XNLI datasets? My email address is wen1591591@gmail.com or 1434800673@qq.com. Thank you very much.