Replication of GPT-2 Results

Please allow another question, which I believe is unrelated to #4.

I was trying to replicate the GPT-2 results using the reference implementation at https://github.com/nyu-mll/jiant/tree/blimp-and-npi/scripts/blimp.

The notebook I used can be found here: https://colab.research.google.com/drive/1OdbuU9Fk37KicNkT7hkJKGA9BqB5t6NB?usp=sharing. I think the only deviation from the instructions was that I installed transformers==2.0.0 to resolve a ModuleNotFoundError.

I then compared the results to the accuracies published in the paper and to the raw results in this repository. For most paradigms, the results were similar. However, sometimes there was large difference to either the paper, the raw data, or both (see data below).

Did I maybe make a mistake in the evaluation, or has a different version of the BLiMP dataset been used originally?

Paradigm | Paper | Raw Data | My Replication -- | -- | -- | -- adjunct_island | 91 | 91.4% | 89.5% anaphor_gender_agreement | 99 | 99.4% | 99.3% anaphor_number_agreement | 100 | 99.7% | 98.9% animate_subject_passive | 77 | 78.6% | 76.4% animate_subject_trans | 80 | 48.5% | 83.5% causative | 68 | 77.1% | 79.0% complex_NP__island | 72 | 71.8% | 71.5% **coordinate_structure_constraint_complex_left_branch** | 42 | 42.4% | 82.2% coordinate_structure_constraint_object_extraction | 88 | 88.3% | 84.9% determiner_noun_agreement_1 | 100 | 99.5% | 98.8% determiner_noun_agreement_2 | 93 | 93.2% | 98.0% determiner_noun_agreement_irregular_1 | 94 | 94.2% | 95.7% determiner_noun_agreement_irregular_2 | 93 | 92.7% | 95.7% determiner_noun_agreement_with_adj_2 | 96 | 90.2% | 95.6% determiner_noun_agreement_with_adj_1 | 90 | 93.1% | 97.5% determiner_noun_agreement_with_adj_irregular_1 | 88 | 95.6% | 93.0% determiner_noun_agreement_with_adj_irregular_2 | 93 | 88.2% | 94.4% distractor_agreement_relational_noun | 83 | 82.7% | 79.9% distractor_agreement_relative_clause | 68 | 68.1% | 64.8% drop_argument | 84 | 79.7% | 80.5% ellipsis_n_bar_1 | 88 | 87.5% | 91.6% ellipsis_n_bar_2 | 86 | 85.7% | 86.7% **existential_there_object_raising** | 92 | 91.5% | 78.8% existential_there_quantifiers_1 | 99 | 98.9% | 99.6% **existential_there_quantifiers_2** | 24 | 24.4% | 42.8% existential_there_subject_raising | 89 | 88.8% | 90.0% **expletive_it_object_raising** | 58 | 58.0% | 79.7% inchoative | 90 | 68.3% | 65.8% intransitive | 90 | 83.8% | 83.9% **irregular_past_participle_adjectives** | 78 | 78.0% | 97.8% irregular_past_participle_verbs | 90 | 90.1% | 87.8% irregular_plural_subject_verb_agreement_1 | 95 | 94.5% | 92.8% irregular_plural_subject_verb_agreement_2 | 96 | 95.9% | 92.3% **left_branch_island_echo_question** | 77 | 77.2% | 51.0% left_branch_island_simple_question | 82 | 81.7% | 89.4% matrix_question_npi_licensor_present | 67 | 66.7% | 68.8% npi_present_1 | 55 | 54.8% | 65.6% npi_present_2 | 62 | 61.6% | 66.0% only_npi_licensor_present | 100 | 99.7% | 90.7% **only_npi_scope** | 85 | 85.4% | 74.3% passive_1 | 89 | 90.0% | 88.5% passive_2 | 79 | 89.8% | 89.8% principle_A_case_1 | 96 | 96.3% | 100.0% **principle_A_case_2** | 73 | 72.7% | 95.5% **principle_A_c_command** | 100 | 100.0% | 73.2% principle_A_domain_1 | 99 | 99.3% | 98.7% principle_A_domain_2 | 73 | 73.1% | 77.1% **principle_A_domain_3** | 82 | 82.2% | 71.7% **principle_A_reconstruction** | 37 | 36.9% | 47.1% regular_plural_subject_verb_agreement_1 | 97 | 96.9% | 95.7% regular_plural_subject_verb_agreement_2 | 96 | 95.6% | 91.4% sentential_negation_npi_licensor_present | 89 | 88.8% | 97.0% **sentential_negation_npi_scope** | 95 | 95.1% | 71.8% sentential_subject_island | 35 | 35.0% | 34.8% superlative_quantifiers_1 | 84 | 83.6% | 85.6% superlative_quantifiers_2 | 78 | 78.1% | 84.8% tough_vs_raising_1 | 72 | 72.0% | 71.0% tough_vs_raising_2 | 92 | 92.1% | 90.6% transitive | 49 | 88.5% | 85.9% wh_island | 77 | 77.2% | 78.8% wh_questions_object_gap | 84 | 83.6% | 83.5% wh_questions_subject_gap | 95 | 94.9% | 95.1% wh_questions_subject_gap_long_distance | 88 | 87.5% | 86.5% wh_vs_that_no_gap | 97 | 97.3% | 97.0% wh_vs_that_no_gap_long_distance | 94 | 94.4% | 93.8% wh_vs_that_with_gap | 56 | 55.9% | 54.2% wh_vs_that_with_gap_long_distance | 56 | 55.5% | 56.5%

Paradigm	Paper	Raw Data	JV's Replication	AW's Revision
adjunct_island	91	91.4%	89.5%	89.4%
anaphor_gender_agreement	99	99.4%	99.3%	99.4%
anaphor_number_agreement	100	99.7%	98.9%	99.2%
animate_subject_passive	77	78.6%	76.4%	76.6%
animate_subject_trans	80	48.5%	83.5%	84.9%
causative	68	77.1%	79.0%	78.2%
complex_NP__island	72	71.8%	71.5%	72.2%
coordinate_structure_constraint_complex_left_branch	42	42.4%	82.2%	81.0%
coordinate_structure_constraint_object_extraction	88	88.3%	84.9%	85.4%
determiner_noun_agreement_1	100	99.5%	98.8%	98.8%
determiner_noun_agreement_2	93	93.2%	98.0%	97.7%
determiner_noun_agreement_irregular_1	94	94.2%	95.7%	95.7%
determiner_noun_agreement_irregular_2	93	92.7%	95.7%	95.3%
determiner_noun_agreement_with_adj_2	96	90.2%	95.6%	94.9%
determiner_noun_agreement_with_adj_1	90	93.1%	97.5%	97.5%
determiner_noun_agreement_with_adj_irregular_1	88	95.6%	93.0%	92.7%
determiner_noun_agreement_with_adj_irregular_2	93	88.2%	94.4%	94.0%
distractor_agreement_relational_noun	83	82.7%	79.9%	79.5%
distractor_agreement_relative_clause	68	68.1%	64.8%	65.7%
drop_argument	84	79.7%	80.5%	80.7%
ellipsis_n_bar_1	88	87.5%	91.6%	91.5%
ellipsis_n_bar_2	86	85.7%	86.7%	87.2%
existential_there_object_raising	92	91.5%	78.8%	78.4%
existential_there_quantifiers_1	99	98.9%	99.6%	99.5%
existential_there_quantifiers_2	24	24.4%	42.8%	42.4%
existential_there_subject_raising	89	88.8%	90.0%	91.1%
expletive_it_object_raising	58	58.0%	79.7%	79.2%
inchoative	90	68.3%	65.8%	65.9%
intransitive	90	83.8%	83.9%	84.2%
irregular_past_participle_adjectives	78	78.0%	97.8%	97.7%
irregular_past_participle_verbs	90	90.1%	87.8%	86.1%
irregular_plural_subject_verb_agreement_1	95	94.5%	92.8%	92.8%
irregular_plural_subject_verb_agreement_2	96	95.9%	92.3%	92.4%
left_branch_island_echo_question	77	77.2%	51.0%	52.3%
left_branch_island_simple_question	82	81.7%	89.4%	87.1%
matrix_question_npi_licensor_present	67	66.7%	68.8%	65.4%
npi_present_1	55	54.8%	65.6%	64.8%
npi_present_2	62	61.6%	66.0%	64.3%
only_npi_licensor_present	100	99.7%	90.7%	94.5%
only_npi_scope	85	85.4%	74.3%	78.5%
passive_1	89	90.0%	88.5%	89.3%
passive_2	79	89.8%	89.8%	90.2%
principle_A_case_1	96	96.3%	100.0%	100.0%
principle_A_case_2	73	72.7%	95.5%	94.8%
principle_A_c_command	100	100.0%	73.2%	73.7%
principle_A_domain_1	99	99.3%	98.7%	98.4%
principle_A_domain_2	73	73.1%	77.1%	77.5%
principle_A_domain_3	82	82.2%	71.7%	75.4%
principle_A_reconstruction	37	36.9%	47.1%	46.8%
regular_plural_subject_verb_agreement_1	97	96.9%	95.7%	96.7%
regular_plural_subject_verb_agreement_2	96	95.6%	91.4%	91.0%
sentential_negation_npi_licensor_present	89	88.8%	97.0%	97.1%
sentential_negation_npi_scope	95	95.1%	71.8%	73.2%
sentential_subject_island	35	35.0%	34.8%	35.5%
superlative_quantifiers_1	84	83.6%	85.6%	87.0%
superlative_quantifiers_2	78	78.1%	84.8%	87.2%
tough_vs_raising_1	72	72.0%	71.0%	72.0%
tough_vs_raising_2	92	92.1%	90.6%	88.9%
transitive	49	88.5%	85.9%	86.0%
wh_island	77	77.2%	78.8%	78.9%
wh_questions_object_gap	84	83.6%	83.5%	84.2%
wh_questions_subject_gap	95	94.9%	95.1%	95.6%
wh_questions_subject_gap_long_distance	88	87.5%	86.5%	87.1%
wh_vs_that_no_gap	97	97.3%	97.0%	96.8%
wh_vs_that_no_gap_long_distance	94	94.4%	93.8%	93.8%
wh_vs_that_with_gap	56	55.9%	54.2%	55.1%
wh_vs_that_with_gap_long_distance	56	55.5%	56.5%	56.3%

Paradigm

Paper

Raw Data

JV's Replication

AW's Revision

adjunct_island

91.4%

89.5%

89.4%

anaphor_gender_agreement

99.4%

99.3%

99.4%

anaphor_number_agreement

100

99.7%

98.9%

99.2%

animate_subject_passive

78.6%

76.4%

76.6%

animate_subject_trans

48.5%

83.5%

84.9%

causative

77.1%

79.0%

78.2%

complex_NP__island

71.8%

71.5%

72.2%

coordinate_structure_constraint_complex_left_branch

42.4%

82.2%

81.0%

coordinate_structure_constraint_object_extraction

88.3%

84.9%

85.4%

determiner_noun_agreement_1

100

99.5%

98.8%

determiner_noun_agreement_2

93.2%

98.0%

97.7%

determiner_noun_agreement_irregular_1

94.2%

95.7%

determiner_noun_agreement_irregular_2

92.7%

95.7%

95.3%

determiner_noun_agreement_with_adj_2

90.2%

95.6%

94.9%

determiner_noun_agreement_with_adj_1

93.1%

97.5%

determiner_noun_agreement_with_adj_irregular_1

95.6%

93.0%

92.7%

determiner_noun_agreement_with_adj_irregular_2

88.2%

94.4%

94.0%

distractor_agreement_relational_noun

82.7%

79.9%

79.5%

distractor_agreement_relative_clause

68.1%

64.8%

65.7%

drop_argument

79.7%

80.5%

80.7%

ellipsis_n_bar_1

87.5%

91.6%

91.5%

ellipsis_n_bar_2

85.7%

86.7%

87.2%

existential_there_object_raising

91.5%

78.8%

78.4%

existential_there_quantifiers_1

98.9%

99.6%

99.5%

existential_there_quantifiers_2

24.4%

42.8%

42.4%

existential_there_subject_raising

88.8%

90.0%

91.1%

expletive_it_object_raising

58.0%

79.7%

79.2%

inchoative

68.3%

65.8%

65.9%

intransitive

83.8%

83.9%

84.2%

irregular_past_participle_adjectives

78.0%

97.8%

97.7%

irregular_past_participle_verbs

90.1%

87.8%

86.1%

irregular_plural_subject_verb_agreement_1

94.5%

92.8%

irregular_plural_subject_verb_agreement_2

95.9%

92.3%

92.4%

left_branch_island_echo_question

77.2%

51.0%

52.3%

left_branch_island_simple_question

81.7%

89.4%

87.1%

matrix_question_npi_licensor_present

66.7%

68.8%

65.4%

npi_present_1

54.8%

65.6%

64.8%

npi_present_2

61.6%

66.0%

64.3%

only_npi_licensor_present

100

99.7%

90.7%

94.5%

only_npi_scope

85.4%

74.3%

78.5%

passive_1

90.0%

88.5%

89.3%

passive_2

89.8%

90.2%

principle_A_case_1

96.3%

100.0%

principle_A_case_2

72.7%

95.5%

94.8%

principle_A_c_command

100

100.0%

73.2%

73.7%

principle_A_domain_1

99.3%

98.7%

98.4%

principle_A_domain_2

73.1%

77.1%

77.5%

principle_A_domain_3

82.2%

71.7%

75.4%

principle_A_reconstruction

36.9%

47.1%

46.8%

regular_plural_subject_verb_agreement_1

96.9%

95.7%

96.7%

regular_plural_subject_verb_agreement_2

95.6%

91.4%

91.0%

sentential_negation_npi_licensor_present

88.8%

97.0%

97.1%

sentential_negation_npi_scope

95.1%

71.8%

73.2%

sentential_subject_island

35.0%

34.8%

35.5%

superlative_quantifiers_1

83.6%

85.6%

87.0%

superlative_quantifiers_2

78.1%

84.8%

87.2%

tough_vs_raising_1

72.0%

71.0%

72.0%

tough_vs_raising_2

92.1%

90.6%

88.9%

transitive

88.5%

85.9%

86.0%

wh_island

77.2%

78.8%

78.9%

wh_questions_object_gap

83.6%

83.5%

84.2%

wh_questions_subject_gap

94.9%

95.1%

95.6%

wh_questions_subject_gap_long_distance

87.5%

86.5%

87.1%

wh_vs_that_no_gap

97.3%

97.0%

96.8%

wh_vs_that_no_gap_long_distance

94.4%

93.8%

wh_vs_that_with_gap

55.9%

54.2%

55.1%

wh_vs_that_with_gap_long_distance

55.5%

56.5%

56.3%

alexwarstadt / blimp

Replication of GPT-2 Results #5