Closed awebson closed 3 years ago
There are a few points to note for some datasets:
For TriviaQA, NQ, and WebQuestions, we are evaluating on the open domain variant of the task and should use the evaluation procedure shown here: https://github.com/google-research/google-research/tree/master/t5_closed_book_qa
Posting a colab that calls the special eval scripts for the above datasets.
https://colab.research.google.com/drive/1G2zxbvi96qxbOv6LNYvTTrcJxsBX_4Hr
Edit:
Dataset splits for open-domain QA are a total mess, so writing this down here to track my own records:
Leo Gao mentions that
for ARC, OpenbookQA, and RACE in particular openai claims that a different kind of normalization described in the paper works really well (they don't really provide any evidence or explanation, they just claim it is and use it for just these 3 tasks)
We aren't doing any kind of length normalization, so if we are underperforming on those tasks, we could consider it.
For Drop, I noticed that the model would predict a number by its word instead of its numeric symbol. So I added that to the normalization process.
Predictions from drop_can_you_tell_me_1112200_predictions
from finetune-t5-xxl-lm-d4-091621-512
Before:
Exact-match accuracy 26.51
F1 score 30.32
26.51 & 30.32
----
date: 152 (1.59%)
Exact-match accuracy 64.474
F1 score 71.829
number: 5826 (61.09%)
Exact-match accuracy 8.393
F1 score 8.937
span: 3069 (32.18%)
Exact-match accuracy 63.245
F1 score 69.709
spans: 489 (5.13%)
Exact-match accuracy 0.000
F1 score 24.871
After:
Exact-match accuracy 31.50
F1 score 35.29
31.50 & 35.29
----
date: 152 (1.59%)
Exact-match accuracy 64.474
F1 score 71.829
number: 5837 (61.21%)
Exact-match accuracy 16.567
F1 score 17.110
span: 3060 (32.09%)
Exact-match accuracy 63.366
F1 score 69.805
spans: 487 (5.11%)
Exact-match accuracy 0.000
F1 score 24.903
According to GPT3 Paper for Zero-shot
Name : DROP
Metric: f1
Split : dev
Small : 9.40
Med : 13.6
Large : 14.4
XL : 16.4
2.7B : 19.7
6.7B : 17.0
13B : 24.0
175B : 23.6
So we are performing significantly better
For the record, each example in coqa and quac is actually N examples, where N is the number of turns of the dialog. Our prompts for coqa and quac will only evaluate on one turn per example. We need to create new serialized versions of the datasets if we are going to evaluate on the full dataset.
Natural Questions: In H.1, the GPT-3 paper says it is reporting results on the "test" split. However, standard open-domain and closed-book QA practice here is to use the validation set as the test set. I'm guessing that's what they mean but I'm waiting for confirmation.
Got confirmation that this is indeed the case. Unfortunately, this split is not available in HF, apart from in the main nq dataset which is utterly colossal. The one in the nq_open dataset (and the one in kilt nq) is different.
@craffel I've actually made prompts for coqa that tries to solve this.
{% set n=25 %}
{% if questions|length > n %}
{{story}}
Q: {{questions[0]}}
{% for i in range(0,n) %}
A: {{answers['input_text'][i]}}
Q: {{questions[i+1]}}
{% endfor %}
A:
|||
{{answers['input_text'][n]}}
{% else %}
Placeholder, Do Not Process
|||
Placeholder, Do Not Process
{% endif %}
But the downside is that it has to be a unique prompt for each number of turns. For coqa the maximum number of turns is 25 so there needs to be 25 unique prompts. The idea is to then collect the prediction to a json and run the official eval script.
So far I've made around 15 unique prompts (just change the number). I can make a pull request if this approach makes sense.
Thanks @lintangsutawika . @zaidalyafeai actually made HFDS variants of the tasks that include prior dialog turns as context. I think we can just use that as the base dataset.
I wrote ten GPT-3 style record prompts, where the model has to rank all the possible choices of the query sentence with every possible entity filled in. They will only make sense for rank eval. I can try to run eval on them before we cache. Not sure if it will help but worth a try. https://github.com/bigscience-workshop/promptsource/pull/490
Datasets that are mostly done but we need to re-run eval and/or compute scores manually for all models:
Datasets where there is still work to be done:
Datasets I don't know the status of:
Since we have the string outputs of all tasks, in principal we should be able to run arbitrary metrics, especially for datasets require fancy metrics.@lintangsutawika has imported the official eval scripts for ReCoRD, SQuAD v2, Natural Questions, TriviaQA, and DROP.
Update: Even when using Lintang's eval scripts, all extractive QAs and closed-book (generative) QAs still have abnormally low numbers, namely:
Also, I think all eval of extractive QA from the training mixture also failed.
(Note that ARC is closed-book, but its performance is fine because it's multiple-choice. A great point in case that machine task categories care more about format way more than human skill/knowledge.)
Others with issues to keep an eye on: