OpenGPTX / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
10 stars 8 forks source link

Evaluation for x_stance dataset #10

Closed karina-hensel closed 2 years ago

karina-hensel commented 2 years ago

Scores on X-Stance dataset

Results


German: Model Source Accuracy Precision Recall F1-score
mGPT (--num_fewshot=5) lm-evaluation-harness 50.56 50.57 50.56 49.94
fastText paper (German set) *- *- *- *69.9
M-BERT paper (German set) *- *- *- *76.8
French: Model Source Accuracy Precision Recall F1-score
gpt2 lm-evaluation-harness 47.47 50.49 50.28 42.74
fastText paper (French set) *- *- *- *71.2
M-BERT paper (French set) *- *- *- *76.6

(*The F1-score for the experiments described in the paper is the macro-average of the F1-scores for ‘favor’ and for ‘against; accuracy, precision and recall are not reported in the original paper)