Closed jvamvas closed 3 years ago
(See comments on Issue 4 for more context) Hi Jannis---I've recomputed results for GPT-2 from our saved model outputs, and while they do not exactly match yours, they are consistently closest to yours. It's possible there's some change on huggingface that leads to small numerical discrepancies, but I'm unsure what could cause that. Assuming that's the case, it looks like your results are probably correct and a close-enough replication of our revised results.
Here are your results next to our revised (I've bolded any case where the difference is >2%)
Paradigm | Paper | Raw Data | JV's Replication | AW's Revision |
---|---|---|---|---|
adjunct_island | 91 | 91.4% | 89.5% | 89.4% |
anaphor_gender_agreement | 99 | 99.4% | 99.3% | 99.4% |
anaphor_number_agreement | 100 | 99.7% | 98.9% | 99.2% |
animate_subject_passive | 77 | 78.6% | 76.4% | 76.6% |
animate_subject_trans | 80 | 48.5% | 83.5% | 84.9% |
causative | 68 | 77.1% | 79.0% | 78.2% |
complex_NP__island | 72 | 71.8% | 71.5% | 72.2% |
coordinate_structure_constraint_complex_left_branch | 42 | 42.4% | 82.2% | 81.0% |
coordinate_structure_constraint_object_extraction | 88 | 88.3% | 84.9% | 85.4% |
determiner_noun_agreement_1 | 100 | 99.5% | 98.8% | 98.8% |
determiner_noun_agreement_2 | 93 | 93.2% | 98.0% | 97.7% |
determiner_noun_agreement_irregular_1 | 94 | 94.2% | 95.7% | 95.7% |
determiner_noun_agreement_irregular_2 | 93 | 92.7% | 95.7% | 95.3% |
determiner_noun_agreement_with_adj_2 | 96 | 90.2% | 95.6% | 94.9% |
determiner_noun_agreement_with_adj_1 | 90 | 93.1% | 97.5% | 97.5% |
determiner_noun_agreement_with_adj_irregular_1 | 88 | 95.6% | 93.0% | 92.7% |
determiner_noun_agreement_with_adj_irregular_2 | 93 | 88.2% | 94.4% | 94.0% |
distractor_agreement_relational_noun | 83 | 82.7% | 79.9% | 79.5% |
distractor_agreement_relative_clause | 68 | 68.1% | 64.8% | 65.7% |
drop_argument | 84 | 79.7% | 80.5% | 80.7% |
ellipsis_n_bar_1 | 88 | 87.5% | 91.6% | 91.5% |
ellipsis_n_bar_2 | 86 | 85.7% | 86.7% | 87.2% |
existential_there_object_raising | 92 | 91.5% | 78.8% | 78.4% |
existential_there_quantifiers_1 | 99 | 98.9% | 99.6% | 99.5% |
existential_there_quantifiers_2 | 24 | 24.4% | 42.8% | 42.4% |
existential_there_subject_raising | 89 | 88.8% | 90.0% | 91.1% |
expletive_it_object_raising | 58 | 58.0% | 79.7% | 79.2% |
inchoative | 90 | 68.3% | 65.8% | 65.9% |
intransitive | 90 | 83.8% | 83.9% | 84.2% |
irregular_past_participle_adjectives | 78 | 78.0% | 97.8% | 97.7% |
irregular_past_participle_verbs | 90 | 90.1% | 87.8% | 86.1% |
irregular_plural_subject_verb_agreement_1 | 95 | 94.5% | 92.8% | 92.8% |
irregular_plural_subject_verb_agreement_2 | 96 | 95.9% | 92.3% | 92.4% |
left_branch_island_echo_question | 77 | 77.2% | 51.0% | 52.3% |
left_branch_island_simple_question | 82 | 81.7% | 89.4% | 87.1% |
matrix_question_npi_licensor_present | 67 | 66.7% | 68.8% | 65.4% |
npi_present_1 | 55 | 54.8% | 65.6% | 64.8% |
npi_present_2 | 62 | 61.6% | 66.0% | 64.3% |
only_npi_licensor_present | 100 | 99.7% | 90.7% | 94.5% |
only_npi_scope | 85 | 85.4% | 74.3% | 78.5% |
passive_1 | 89 | 90.0% | 88.5% | 89.3% |
passive_2 | 79 | 89.8% | 89.8% | 90.2% |
principle_A_case_1 | 96 | 96.3% | 100.0% | 100.0% |
principle_A_case_2 | 73 | 72.7% | 95.5% | 94.8% |
principle_A_c_command | 100 | 100.0% | 73.2% | 73.7% |
principle_A_domain_1 | 99 | 99.3% | 98.7% | 98.4% |
principle_A_domain_2 | 73 | 73.1% | 77.1% | 77.5% |
principle_A_domain_3 | 82 | 82.2% | 71.7% | 75.4% |
principle_A_reconstruction | 37 | 36.9% | 47.1% | 46.8% |
regular_plural_subject_verb_agreement_1 | 97 | 96.9% | 95.7% | 96.7% |
regular_plural_subject_verb_agreement_2 | 96 | 95.6% | 91.4% | 91.0% |
sentential_negation_npi_licensor_present | 89 | 88.8% | 97.0% | 97.1% |
sentential_negation_npi_scope | 95 | 95.1% | 71.8% | 73.2% |
sentential_subject_island | 35 | 35.0% | 34.8% | 35.5% |
superlative_quantifiers_1 | 84 | 83.6% | 85.6% | 87.0% |
superlative_quantifiers_2 | 78 | 78.1% | 84.8% | 87.2% |
tough_vs_raising_1 | 72 | 72.0% | 71.0% | 72.0% |
tough_vs_raising_2 | 92 | 92.1% | 90.6% | 88.9% |
transitive | 49 | 88.5% | 85.9% | 86.0% |
wh_island | 77 | 77.2% | 78.8% | 78.9% |
wh_questions_object_gap | 84 | 83.6% | 83.5% | 84.2% |
wh_questions_subject_gap | 95 | 94.9% | 95.1% | 95.6% |
wh_questions_subject_gap_long_distance | 88 | 87.5% | 86.5% | 87.1% |
wh_vs_that_no_gap | 97 | 97.3% | 97.0% | 96.8% |
wh_vs_that_no_gap_long_distance | 94 | 94.4% | 93.8% | 93.8% |
wh_vs_that_with_gap | 56 | 55.9% | 54.2% | 55.1% |
wh_vs_that_with_gap_long_distance | 56 | 55.5% | 56.5% | 56.3% |
Please allow another question, which I believe is unrelated to #4.
I was trying to replicate the GPT-2 results using the reference implementation at https://github.com/nyu-mll/jiant/tree/blimp-and-npi/scripts/blimp.
The notebook I used can be found here: https://colab.research.google.com/drive/1OdbuU9Fk37KicNkT7hkJKGA9BqB5t6NB?usp=sharing. I think the only deviation from the instructions was that I installed
transformers==2.0.0
to resolve aModuleNotFoundError
.I then compared the results to the accuracies published in the paper and to the raw results in this repository. For most paradigms, the results were similar. However, sometimes there was large difference to either the paper, the raw data, or both (see data below).
Did I maybe make a mistake in the evaluation, or has a different version of the BLiMP dataset been used originally?