evaluation for duplicated answer choices

OpenBioLink / ThoughtSource

A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/

MIT License

887 stars 72 forks source link

evaluation for duplicated answer choices #117

Open KonstantinHebenstreit opened 1 year ago

KonstantinHebenstreit commented 1 year ago

In datasets are sometimes examples with 4 or 5 answer choices. I think what has been done is just to duplicate one of the answer choices to always have 5 choices. The evaluation script does not include this option. Since we put letters in front of the choices (A,B,C,D,E), the model can also answer with a letter. But if the right choice it as two places it has two letters. This can lead to wrong evaluation scores based on the letters.

First example is commonsense_qa, but there might be others.

KonstantinHebenstreit commented 1 year ago

Helper code to find those examples:

coll["commonsense_qa"].filter(lambda example: len(set(example["choices"]))==4)