Closed LucWeber closed 10 months ago
Hello!
We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), which is why I wanted to remind you of the fact that your PR still needs some attention. Please double-check the automated tests, and don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.
Good luck finalising your PR and paper, feel free to tag us if you have questions. Cheers, Verna On behalf of the GenBench team
ICL consistency test
This task tests the consistency of prompt-based model predictions across a wide range of different prompt-setups, calculating accuracy- and consistency-scores.
Authors
lucas.weber@upf.edu
elia.bruni@gmail.com
dieuwkehupkes@meta.com
Implementation
There is no data-preprocessing necessary. We implemented a custom
evaluate_predictions()
-method to calculate accuracy and consistency scores for each setup separately.Usage
The custom
evaluate_predictions()
-method accepts inputs in the default format withpredictions
expecting aDict[str, Dict[str, Any]]
andgold
expecting adatasets.Dataset
. Forpredictions
, the keys of the outer dictionary should represent thesetup_IDs
and the keys of the inner dictionary should represent the respectivedata_IDs
. For a fully implemented example evaluation pipeline using huggingface, seeexample_evaluation.py
.Checklist:
genbench-cli test-task
tool.