allenai / qasper-led-baseline

Apache License 2.0
50 stars 9 forks source link

The "anwser" for some examples is confusing #27

Closed Zcchill closed 2 months ago

Zcchill commented 2 months ago

Ref to: #THUDM/LongBench#67 Longbench dataset contains a sub-dataset of "qasper". But I found that the "answers" of several examples are confusing. I want to know whether it is my misunderstanding of the dataset or an issue with the data annotatation. {"pred": "No", "answers": ["Yes", "No"], "all_classes": null, "length": 2317, "input": "Does this method help in sentiment classification task improvement?", "_id": "bcfe56efad9715cc714ffd2e523eaa9ad796a453e7da77a6"} {"pred": "unanswerable", "answers": ["Yes", "Unanswerable"], "all_classes": null, "length": 2284, "actual_length": 3533, "input": "Is jiant compatible with models in any programming language?", "_id": "e5d1d589ddb30f43547012f04b06ac2924a1f4fdcf56daab"} {"pred": "BERTBase", "answers": ["BERTbase", "BERTbase"], "all_classes": null, "length": 3852, "actual_length": 5701, "input": "What BERT model do they test?", "_id": "2a51c07e65a9214ed2cd3c04303afa205e005f4e1ccb172a"}

pdasigi commented 2 months ago

@Zcchill Can you elaborate what is confusing about the answers? If it is that there are multiple answers which sometimes contradict with each other, that is because the annotators do not always agree with each other as is expected in difficult task requiring expert knowledge. The disagreements are quantified in the paper associated with the dataset. Also, the prescribed evaluation method is to consider a prediction to be correct if it matches any of the ground truth answers.

Zcchill commented 2 months ago

I see. Thanks for getting back to me!