EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.95k stars 1.58k forks source link

XNLI weird result with gemma-2b #1583

Open SefaZeng opened 4 months ago

SefaZeng commented 4 months ago

Evaluating gemma-2b with xcopa looks good, but the xnli result looks weird.

xcopa result:

  "results": {
    "xcopa_zh": {
      "acc,none": 0.616,
      "acc_stderr,none": 0.021772369465547194,
      "alias": "xcopa_zh"
    },
    "xcopa_vi": {
      "acc,none": 0.674,
      "acc_stderr,none": 0.02098400956239357,
      "alias": "xcopa_vi"
    },
    "xcopa_tr": {
      "acc,none": 0.58,
      "acc_stderr,none": 0.02209471322976178,
      "alias": "xcopa_tr"
    },
    "xcopa_th": {
      "acc,none": 0.57,
      "acc_stderr,none": 0.022162634426652835,
      "alias": "xcopa_th"
    },
    "xcopa_it": {
      "acc,none": 0.618,
      "acc_stderr,none": 0.02175082059125084,
      "alias": "xcopa_it"
    },
    "xcopa_id": {
      "acc,none": 0.646,
      "acc_stderr,none": 0.021407582047916447,
      "alias": "xcopa_id"
    }
  },

xnli result:

| Tasks |Version|Filter|n-shot|Metric|Value |   |Stderr|
|-------|------:|------|-----:|------|-----:|---|-----:|
|xnli_zh|      1|none  |     0|acc   |0.3261|±  |0.0094|
|xnli_vi|      1|none  |     0|acc   |0.3594|±  |0.0096|
|xnli_tr|      1|none  |     0|acc   |0.3458|±  |0.0095|
|xnli_th|      1|none  |     0|acc   |0.3317|±  |0.0094|
|xnli_ru|      1|none  |     0|acc   |0.3390|±  |0.0095|
|xnli_hi|      1|none  |     0|acc   |0.3382|±  |0.0095|
|xnli_fr|      1|none  |     0|acc   |0.3297|±  |0.0094|
|xnli_es|      1|none  |     0|acc   |0.3418|±  |0.0095|
|xnli_en|      1|none  |     0|acc   |0.3554|±  |0.0096|
|xnli_de|      1|none  |     0|acc   |0.3450|±  |0.0095|
|xnli_ar|      1|none  |     0|acc   |0.3390|±  |0.0095|

Score around 0.33 is more like random guess?

haileyschoelkopf commented 4 months ago

Maybe @lintangsutawika has some ideas on this.

As a whole (small?) LMs are pretty bad at the NLI task though I think. Maybe it's a matter of prompting?