adobe-research / deft_corpus

The Definition Extraction From Text corpus and relevant formatting scripts
Other
79 stars 25 forks source link

Missing relations #40

Open davletov-aa opened 4 years ago

davletov-aa commented 4 years ago

I found 266 examples (context-windows) which have tokens with root_ids marked as "0" and tag_id, say TXXX, but there are no tokens with root_id TXXX in example in train and dev set.

For example there is such T105 tokens:

data/source_txt/t3_physics_2_101.deft TOKEN ROOT_ID TAG_ID RELATION 3161 -1 -1 0 . -1 -1 0 Another -1 -1 0 is -1 -1 0 what -1 -1 0 Democritus -1 -1 0 in -1 -1 0 particular -1 -1 0 believed -1 -1 0 — -1 -1 0 that -1 -1 0 there 0 T106 0 is 0 T106 0 a 0 T106 0 smallest 0 T106 0 unit 0 T106 0 that 0 T106 0 can 0 T106 0 not 0 T106 0 be 0 T106 0 further 0 T106 0 subdivided 0 T106 0 . -1 -1 0 Democritus -1 -1 0 called -1 -1 0 this T106 T194 Refers-To the 0 T105 0 atom 0 T105 0 . -1 -1 0 We -1 -1 0 now -1 -1 0 know -1 -1 0 that -1 -1 0 atoms -1 -1 0 themselves -1 -1 0 can -1 -1 0 be -1 -1 0 subdivided -1 -1 0 , -1 -1 0 but -1 -1 0 their -1 -1 0 identity -1 -1 0 is -1 -1 0 destroyed -1 -1 0 in -1 -1 0 the -1 -1 0 process -1 -1 0 , -1 -1 0 so -1 -1 0 the -1 -1 0 Greeks -1 -1 0 were -1 -1 0 correct -1 -1 0 in -1 -1 0 a -1 -1 0 respect -1 -1 0 . -1 -1 0

sashaspala commented 4 years ago

Thanks for reporting - I'm looking into this now. It has to do with the fix we settled on for long distance relationships (i.e. Secondary Def --> Definition --> Term), which was to mark only the final tag in the relationship as the root, so that you would have relationships in the .deft files like this, where the Term is the root: (Secondary Def, T1, T2, Supplements) (Definition, T2, T3, Direct Defines) (Term, T3, 0, 0)

sashaspala commented 4 years ago

I take it back - on inspection this is actually a problem with overlapping relationships. In this case, there was a referential-definition (this) that "refers-to" the definition (there is a smallest unit that cannot be further subdivided) and also "indirect-defines" the term (the atom). Someone brought this up in the forums yesterday and we're aware of the problem. I'm working on finding a fix right now that handles this scenario without undermining our existing data format.

davletov-aa commented 4 years ago

Hi, there are still the problems with missing relations in train and dev sets (it seems I have an actual state of data, please check it): {'data/source_txt/t3_physics_2_101.deft': {'T105', 'T109', 'T134', 'T145', 'T31'}, 'data/source_txt/t6_sociology_1_101.deft': {'T125', 'T142', 'T58'}, 'data/source_txt/t1_biology_1_505.deft': {'T189', 'T195', 'T241', 'T246', 'T282', 'T283', 'T72', 'T74', 'T86'}, 'data/source_txt/t2_history_0_0.deft': {'T151', 'T162', 'T47', 'T81', 'T95'}, 'data/source_txt/t6_sociology_0_101.deft': {'T76', 'T98'}, 'data/source_txt/t2_history_2_101.deft': {'T111', 'T131'}, 'data/source_txt/t7_government_1_101.deft': {'T103', 'T116'}, 'data/source_txt/t7_government_1_404.deft': {'T13'}, 'data/source_txt/t1_biology_0_303.deft': {'T129', 'T131', 'T176', 'T26', 'T296', 'T79', 'T82', 'T9', 'T94'}, 'data/source_txt/t1_biology_1_404.deft': {'T113', 'T173', 'T194', 'T195', 'T223', 'T231', 'T36', 'T7'}, 'data/source_txt/t5_economic_1_0.deft': {'T103', 'T140', 'T154', 'T50', 'T73', 'T89', 'T95'}, 'data/source_txt/t1_biology_2_404.deft': {'T113', 'T150', 'T167', 'T205', 'T228', 'T295', 'T299', 'T42'}, 'data/source_txt/t4_psychology_2_0.deft': {'T127', 'T204', 'T209', 'T232', 'T38'}, 'data/source_txt/t3_physics_0_101.deft': {'T157', 'T174', 'T39'}, 'data/source_txt/t7_government_0_303.deft': {'T20'}, 'data/source_txt/t5_economic_0_202.deft': {'T137'}, 'data/source_txt/t5_economic_1_202.deft': {'T47'}, 'data/source_txt/t4_psychology_0_303.deft': {'T17'}, 'data/source_txt/t7_government_1_0.deft': {'T16'}, 'data/source_txt/t1_biology_2_606.deft': {'T207', 'T259', 'T28', 'T37', 'T59', 'T83'}, 'data/source_txt/t4_psychology_1_0.deft': {'T123', 'T165', 'T200', 'T216', 'T221', 'T32'}, 'data/source_txt/t2_history_2_0.deft': {'T146', 'T151', 'T179', 'T25', 'T53', 'T76'}, 'data/source_txt/t7_government_1_303.deft': {'T13'}, 'data/source_txt/t1_biology_1_303.deft': {'T105', 'T15', 'T86'}, 'data/source_txt/t7_government_0_202.deft': {'T31', 'T35'}, 'data/source_txt/t1_biology_0_101.deft': {'T131', 'T261', 'T82'}, 'data/source_txt/t4_psychology_2_101.deft': {'T198', 'T31', 'T7'}, 'data/source_txt/t4_psychology_0_202.deft': {'T102', 'T21', 'T35', 'T36', 'T83'}, 'data/source_txt/t5_economic_0_101.deft': {'T1', 'T180', 'T7', 'T86'}, 'data/source_txt/t2_history_1_0.deft': {'T110', 'T158', 'T23', 'T51', 'T69', 'T7'}, 'data/source_txt/t1_biology_2_505.deft': {'T204', 'T229', 'T36'}, 'data/source_txt/t6_sociology_0_0.deft': {'T147', 'T40', 'T54', 'T82'}, 'data/source_txt/t1_biology_2_303.deft': {'T227', 'T36', 'T61'}, 'data/source_txt/t1_biology_1_0.deft': {'T143', 'T177', 'T238', 'T27', 'T47', 'T80'}, 'data/source_txt/t1_biology_0_0.deft': {'T103', 'T105', 'T109', 'T139', 'T151', 'T193', 'T211'}, 'data/source_txt/t7_government_1_202.deft': {'T88', 'T97'}, 'data/source_txt/t1_biology_2_101.deft': {'T127', 'T236', 'T243', 'T257', 'T261'}, 'data/source_txt/t2_history_0_101.deft': {'T9', 'T95'}, 'data/source_txt/t4_psychology_0_101.deft': {'T228', 'T248', 'T272', 'T28'}, 'data/source_txt/t3_physics_1_101.deft': {'T113', 'T143', 'T212', 'T31', 'T74', 'T98'}, 'data/source_txt/t3_physics_1_0.deft': {'T123', 'T126', 'T135', 'T152', 'T34', 'T43'}, 'data/source_txt/t1_biology_0_202.deft': {'T101', 'T120', 'T151', 'T159', 'T169', 'T281', 'T292', 'T298', 'T314', 'T51', 'T52', 'T56', 'T6', 'T64', 'T70', 'T85'}, 'data/source_txt/t5_economic_2_0.deft': {'T105', 'T168', 'T171', 'T63', 'T77', 'T89'}, 'data/source_txt/t7_government_2_0.deft': {'T20', 'T31', 'T36', 'T6'}, 'data/source_txt/t1_biology_1_606.deft': {'T127', 'T136', 'T18', 'T213', 'T230', 'T28', 'T89', 'T94', 'T99'}, 'data/source_txt/t4_psychology_2_202.deft': {'T38'}, 'data/source_txt/t7_government_2_202.deft': {'T31'}, 'data/source_txt/t5_economic_2_101.deft': {'T65'}, 'data/source_txt/t7_government_0_404.deft': {'T32', 'T36', 'T43'}, 'data/source_txt/t1_biology_1_101.deft': {'T100', 'T180', 'T188', 'T254', 'T54', 'T55'}, 'data/source_txt/t6_sociology_2_101.deft': {'T31'}, 'data/source_txt/t3_physics_2_0.deft': {'T135', 'T182', 'T19', 'T8', 'T96'}, 'data/source_txt/t2_history_1_101.deft': {'T72', 'T81'}, 'data/source_txt/t1_biology_0_606.deft': {'T253', 'T3', 'T85'}, 'data/source_txt/t1_biology_0_404.deft': {'T15', 'T159', 'T232', 'T246', 'T288', 'T346', 'T38', 'T62', 'T77', 'T9'}, 'data/source_txt/t5_economic_0_0.deft': {'T145'}, 'data/source_txt/t5_economic_2_202.deft': {'T140', 'T2', 'T93'}, 'data/source_txt/t4_psychology_0_0.deft': {'T212', 'T4', 'T72', 'T78', 'T82'}, 'data/source_txt/t1_biology_2_0.deft': {'T39', 'T59', 'T72', 'T98'}, 'data/source_txt/t4_psychology_1_101.deft': {'T157', 'T178', 'T179', 'T189', 'T210'}, 'data/source_txt/t1_biology_1_202.deft': {'T116', 'T16', 'T163', 'T172', 'T271', 'T30', 'T40', 'T57'}, 'data/source_txt/t4_psychology_1_202.deft': {'T113', 'T155', 'T28', 'T4', 'T44'}, 'data/source_txt/t7_government_0_101.deft': {'T72'}, 'data/source_txt/t1_biology_2_202.deft': {'T194', 'T203', 'T230', 'T263', 'T77'}, 'data/source_txt/t3_physics_0_0.deft': {'T29'}, 'data/source_txt/t7_government_2_101.deft': {'T31'}, 'data/source_txt/t7_government_2_303.deft': {'T7', 'T9'}}

davletov-aa commented 4 years ago

And here a little bit of left examples: {'data/source_txt/t1_biology_1_505.deft': {'T190', 'T195', 'T243', 'T246', 'T282', 'T283'}, 'data/source_txt/t1_biology_0_303.deft': {'T129', 'T131', 'T176', 'T296', 'T78', 'T94'}, 'data/source_txt/t1_biology_0_101.deft': {'T261'}, 'data/source_txt/t4_psychology_0_101.deft': {'T228', 'T248'}, 'data/source_txt/t5_economic_2_0.deft': {'T107', 'T78'}}