alexwarstadt / blimp

The Benchmark of Linguistic Minimal Pairs
141 stars 13 forks source link

Replication of GPT-2 Results #5

Closed jvamvas closed 3 years ago

jvamvas commented 3 years ago

Please allow another question, which I believe is unrelated to #4.

I was trying to replicate the GPT-2 results using the reference implementation at https://github.com/nyu-mll/jiant/tree/blimp-and-npi/scripts/blimp.

The notebook I used can be found here: https://colab.research.google.com/drive/1OdbuU9Fk37KicNkT7hkJKGA9BqB5t6NB?usp=sharing. I think the only deviation from the instructions was that I installed transformers==2.0.0 to resolve a ModuleNotFoundError.

I then compared the results to the accuracies published in the paper and to the raw results in this repository. For most paradigms, the results were similar. However, sometimes there was large difference to either the paper, the raw data, or both (see data below).

Did I maybe make a mistake in the evaluation, or has a different version of the BLiMP dataset been used originally?

Paradigm | Paper | Raw Data | My Replication -- | -- | -- | -- adjunct_island | 91 | 91.4% | 89.5% anaphor_gender_agreement | 99 | 99.4% | 99.3% anaphor_number_agreement | 100 | 99.7% | 98.9% animate_subject_passive | 77 | 78.6% | 76.4% animate_subject_trans | 80 | 48.5% | 83.5% causative | 68 | 77.1% | 79.0% complex_NP__island | 72 | 71.8% | 71.5% **coordinate_structure_constraint_complex_left_branch** | 42 | 42.4% | 82.2% coordinate_structure_constraint_object_extraction | 88 | 88.3% | 84.9% determiner_noun_agreement_1 | 100 | 99.5% | 98.8% determiner_noun_agreement_2 | 93 | 93.2% | 98.0% determiner_noun_agreement_irregular_1 | 94 | 94.2% | 95.7% determiner_noun_agreement_irregular_2 | 93 | 92.7% | 95.7% determiner_noun_agreement_with_adj_2 | 96 | 90.2% | 95.6% determiner_noun_agreement_with_adj_1 | 90 | 93.1% | 97.5% determiner_noun_agreement_with_adj_irregular_1 | 88 | 95.6% | 93.0% determiner_noun_agreement_with_adj_irregular_2 | 93 | 88.2% | 94.4% distractor_agreement_relational_noun | 83 | 82.7% | 79.9% distractor_agreement_relative_clause | 68 | 68.1% | 64.8% drop_argument | 84 | 79.7% | 80.5% ellipsis_n_bar_1 | 88 | 87.5% | 91.6% ellipsis_n_bar_2 | 86 | 85.7% | 86.7% **existential_there_object_raising** | 92 | 91.5% | 78.8% existential_there_quantifiers_1 | 99 | 98.9% | 99.6% **existential_there_quantifiers_2** | 24 | 24.4% | 42.8% existential_there_subject_raising | 89 | 88.8% | 90.0% **expletive_it_object_raising** | 58 | 58.0% | 79.7% inchoative | 90 | 68.3% | 65.8% intransitive | 90 | 83.8% | 83.9% **irregular_past_participle_adjectives** | 78 | 78.0% | 97.8% irregular_past_participle_verbs | 90 | 90.1% | 87.8% irregular_plural_subject_verb_agreement_1 | 95 | 94.5% | 92.8% irregular_plural_subject_verb_agreement_2 | 96 | 95.9% | 92.3% **left_branch_island_echo_question** | 77 | 77.2% | 51.0% left_branch_island_simple_question | 82 | 81.7% | 89.4% matrix_question_npi_licensor_present | 67 | 66.7% | 68.8% npi_present_1 | 55 | 54.8% | 65.6% npi_present_2 | 62 | 61.6% | 66.0% only_npi_licensor_present | 100 | 99.7% | 90.7% **only_npi_scope** | 85 | 85.4% | 74.3% passive_1 | 89 | 90.0% | 88.5% passive_2 | 79 | 89.8% | 89.8% principle_A_case_1 | 96 | 96.3% | 100.0% **principle_A_case_2** | 73 | 72.7% | 95.5% **principle_A_c_command** | 100 | 100.0% | 73.2% principle_A_domain_1 | 99 | 99.3% | 98.7% principle_A_domain_2 | 73 | 73.1% | 77.1% **principle_A_domain_3** | 82 | 82.2% | 71.7% **principle_A_reconstruction** | 37 | 36.9% | 47.1% regular_plural_subject_verb_agreement_1 | 97 | 96.9% | 95.7% regular_plural_subject_verb_agreement_2 | 96 | 95.6% | 91.4% sentential_negation_npi_licensor_present | 89 | 88.8% | 97.0% **sentential_negation_npi_scope** | 95 | 95.1% | 71.8% sentential_subject_island | 35 | 35.0% | 34.8% superlative_quantifiers_1 | 84 | 83.6% | 85.6% superlative_quantifiers_2 | 78 | 78.1% | 84.8% tough_vs_raising_1 | 72 | 72.0% | 71.0% tough_vs_raising_2 | 92 | 92.1% | 90.6% transitive | 49 | 88.5% | 85.9% wh_island | 77 | 77.2% | 78.8% wh_questions_object_gap | 84 | 83.6% | 83.5% wh_questions_subject_gap | 95 | 94.9% | 95.1% wh_questions_subject_gap_long_distance | 88 | 87.5% | 86.5% wh_vs_that_no_gap | 97 | 97.3% | 97.0% wh_vs_that_no_gap_long_distance | 94 | 94.4% | 93.8% wh_vs_that_with_gap | 56 | 55.9% | 54.2% wh_vs_that_with_gap_long_distance | 56 | 55.5% | 56.5%
alexwarstadt commented 3 years ago

(See comments on Issue 4 for more context) Hi Jannis---I've recomputed results for GPT-2 from our saved model outputs, and while they do not exactly match yours, they are consistently closest to yours. It's possible there's some change on huggingface that leads to small numerical discrepancies, but I'm unsure what could cause that. Assuming that's the case, it looks like your results are probably correct and a close-enough replication of our revised results.

Here are your results next to our revised (I've bolded any case where the difference is >2%)

Paradigm Paper Raw Data JV's Replication AW's Revision
adjunct_island 91 91.4% 89.5% 89.4%
anaphor_gender_agreement 99 99.4% 99.3% 99.4%
anaphor_number_agreement 100 99.7% 98.9% 99.2%
animate_subject_passive 77 78.6% 76.4% 76.6%
animate_subject_trans 80 48.5% 83.5% 84.9%
causative 68 77.1% 79.0% 78.2%
complex_NP__island 72 71.8% 71.5% 72.2%
coordinate_structure_constraint_complex_left_branch 42 42.4% 82.2% 81.0%
coordinate_structure_constraint_object_extraction 88 88.3% 84.9% 85.4%
determiner_noun_agreement_1 100 99.5% 98.8% 98.8%
determiner_noun_agreement_2 93 93.2% 98.0% 97.7%
determiner_noun_agreement_irregular_1 94 94.2% 95.7% 95.7%
determiner_noun_agreement_irregular_2 93 92.7% 95.7% 95.3%
determiner_noun_agreement_with_adj_2 96 90.2% 95.6% 94.9%
determiner_noun_agreement_with_adj_1 90 93.1% 97.5% 97.5%
determiner_noun_agreement_with_adj_irregular_1 88 95.6% 93.0% 92.7%
determiner_noun_agreement_with_adj_irregular_2 93 88.2% 94.4% 94.0%
distractor_agreement_relational_noun 83 82.7% 79.9% 79.5%
distractor_agreement_relative_clause 68 68.1% 64.8% 65.7%
drop_argument 84 79.7% 80.5% 80.7%
ellipsis_n_bar_1 88 87.5% 91.6% 91.5%
ellipsis_n_bar_2 86 85.7% 86.7% 87.2%
existential_there_object_raising 92 91.5% 78.8% 78.4%
existential_there_quantifiers_1 99 98.9% 99.6% 99.5%
existential_there_quantifiers_2 24 24.4% 42.8% 42.4%
existential_there_subject_raising 89 88.8% 90.0% 91.1%
expletive_it_object_raising 58 58.0% 79.7% 79.2%
inchoative 90 68.3% 65.8% 65.9%
intransitive 90 83.8% 83.9% 84.2%
irregular_past_participle_adjectives 78 78.0% 97.8% 97.7%
irregular_past_participle_verbs 90 90.1% 87.8% 86.1%
irregular_plural_subject_verb_agreement_1 95 94.5% 92.8% 92.8%
irregular_plural_subject_verb_agreement_2 96 95.9% 92.3% 92.4%
left_branch_island_echo_question 77 77.2% 51.0% 52.3%
left_branch_island_simple_question 82 81.7% 89.4% 87.1%
matrix_question_npi_licensor_present 67 66.7% 68.8% 65.4%
npi_present_1 55 54.8% 65.6% 64.8%
npi_present_2 62 61.6% 66.0% 64.3%
only_npi_licensor_present 100 99.7% 90.7% 94.5%
only_npi_scope 85 85.4% 74.3% 78.5%
passive_1 89 90.0% 88.5% 89.3%
passive_2 79 89.8% 89.8% 90.2%
principle_A_case_1 96 96.3% 100.0% 100.0%
principle_A_case_2 73 72.7% 95.5% 94.8%
principle_A_c_command 100 100.0% 73.2% 73.7%
principle_A_domain_1 99 99.3% 98.7% 98.4%
principle_A_domain_2 73 73.1% 77.1% 77.5%
principle_A_domain_3 82 82.2% 71.7% 75.4%
principle_A_reconstruction 37 36.9% 47.1% 46.8%
regular_plural_subject_verb_agreement_1 97 96.9% 95.7% 96.7%
regular_plural_subject_verb_agreement_2 96 95.6% 91.4% 91.0%
sentential_negation_npi_licensor_present 89 88.8% 97.0% 97.1%
sentential_negation_npi_scope 95 95.1% 71.8% 73.2%
sentential_subject_island 35 35.0% 34.8% 35.5%
superlative_quantifiers_1 84 83.6% 85.6% 87.0%
superlative_quantifiers_2 78 78.1% 84.8% 87.2%
tough_vs_raising_1 72 72.0% 71.0% 72.0%
tough_vs_raising_2 92 92.1% 90.6% 88.9%
transitive 49 88.5% 85.9% 86.0%
wh_island 77 77.2% 78.8% 78.9%
wh_questions_object_gap 84 83.6% 83.5% 84.2%
wh_questions_subject_gap 95 94.9% 95.1% 95.6%
wh_questions_subject_gap_long_distance 88 87.5% 86.5% 87.1%
wh_vs_that_no_gap 97 97.3% 97.0% 96.8%
wh_vs_that_no_gap_long_distance 94 94.4% 93.8% 93.8%
wh_vs_that_with_gap 56 55.9% 54.2% 55.1%
wh_vs_that_with_gap_long_distance 56 55.5% 56.5% 56.3%