Open SefaZeng opened 4 months ago
Evaluating gemma-2b with xcopa looks good, but the xnli result looks weird.
xcopa result:
"results": { "xcopa_zh": { "acc,none": 0.616, "acc_stderr,none": 0.021772369465547194, "alias": "xcopa_zh" }, "xcopa_vi": { "acc,none": 0.674, "acc_stderr,none": 0.02098400956239357, "alias": "xcopa_vi" }, "xcopa_tr": { "acc,none": 0.58, "acc_stderr,none": 0.02209471322976178, "alias": "xcopa_tr" }, "xcopa_th": { "acc,none": 0.57, "acc_stderr,none": 0.022162634426652835, "alias": "xcopa_th" }, "xcopa_it": { "acc,none": 0.618, "acc_stderr,none": 0.02175082059125084, "alias": "xcopa_it" }, "xcopa_id": { "acc,none": 0.646, "acc_stderr,none": 0.021407582047916447, "alias": "xcopa_id" } },
xnli result:
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr| |-------|------:|------|-----:|------|-----:|---|-----:| |xnli_zh| 1|none | 0|acc |0.3261|± |0.0094| |xnli_vi| 1|none | 0|acc |0.3594|± |0.0096| |xnli_tr| 1|none | 0|acc |0.3458|± |0.0095| |xnli_th| 1|none | 0|acc |0.3317|± |0.0094| |xnli_ru| 1|none | 0|acc |0.3390|± |0.0095| |xnli_hi| 1|none | 0|acc |0.3382|± |0.0095| |xnli_fr| 1|none | 0|acc |0.3297|± |0.0094| |xnli_es| 1|none | 0|acc |0.3418|± |0.0095| |xnli_en| 1|none | 0|acc |0.3554|± |0.0096| |xnli_de| 1|none | 0|acc |0.3450|± |0.0095| |xnli_ar| 1|none | 0|acc |0.3390|± |0.0095|
Score around 0.33 is more like random guess?
Maybe @lintangsutawika has some ideas on this.
As a whole (small?) LMs are pretty bad at the NLI task though I think. Maybe it's a matter of prompting?
Evaluating gemma-2b with xcopa looks good, but the xnli result looks weird.
xcopa result:
xnli result:
Score around 0.33 is more like random guess?