ICL consistency test

This task tests the consistency of prompt-based model predictions across a wide range of different prompt-setups, calculating accuracy- and consistency-scores.

Authors

Lucas Weber lucas.weber@upf.edu
Elia Bruni elia.bruni@gmail.com
Dieuwke Hupkes dieuwkehupkes@meta.com

Implementation

There is no data-preprocessing necessary. We implemented a custom evaluate_predictions()-method to calculate accuracy and consistency scores for each setup separately.

Usage

The custom evaluate_predictions()-method accepts inputs in the default format with predictions expecting a Dict[str, Dict[str, Any]] and gold expecting a datasets.Dataset. For predictions, the keys of the outer dictionary should represent the setup_IDs and the keys of the inner dictionary should represent the respective data_IDs. For a fully implemented example evaluation pipeline using huggingface, see example_evaluation.py.

Checklist:

[x] I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
[x] Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
[x] I have read the description of what should be in the doc.md of my task, and have added the required arguments.
[x] I have submitted or will submit an accompanying paper to the GenBench workshop.

GenBench / genbench_cbt_2023

[Task Submission] ICL consistency test (`icl_consistency_test`) #11

ICL consistency test

Authors

Implementation

Usage

Checklist: