allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Fix for mention predictor span_group overlap #252

Closed geli-gel closed 1 year ago

geli-gel commented 1 year ago

Fix for https://github.com/allenai/scholar/issues/36714

There was a bug in the predictor code causing us to miss matching word_ids across batches

If we were trying to find if the previous word_id matched the current word_id, we'd miss it if it was from the previous batch and the previous batch had any Nones after it.

This PR moves the code around so that we don't loop through the word_id and label_ids zips until we've looked through the whole page's words and checked if the word_id is found in the previous batch's list of word_ids.

I'm not sure if all the Nones need to be saved before zipping but that's how I was able to get it to work!

Used pdf that we error on w/ overlapping spangroups with sha 5a1f34f57771e311e8e1a2bd953263b8183a3487 (see which spans overlapped )

with no changes, mention spangroups go from 0-191 (but they can't be annotated onto the doc because of overlap)

with my beautiful changes they go from 0-189. I know 1 was junk so, I ran the original code and printed out what the spangroups texts would be if there wasn't the overlapping problem child spangroup by just filtering it out then annotating onto the doc. there were no other overlapping spangroups so it was successful, but I found the reason for the other "missing" SpanGroup and it looks like this fix adds more improvement: the old code gets (id, text):

166 Gupta et al .
167 2020

and the new and improved code gets (id, text):

165 Gupta et al . 2020

yay!!!!!!! I fixed it!!

PS: old overlapping spangroups (id, len(spans), spans):

0 9 [{'start': 3014, 'end': 3020, 'box': {'left': 0.3214123112304209, 'top': 0.6161514386418996, 'width': 0.0396044117350607, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3020, 'end': 3021, 'box': {'left': 0.36101672296548165, 'top': 0.6161514386418996, 'width': 0.004284452926037671, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3021, 'end': 3027, 'box': {'left': 0.3653011758915193, 'top': 0.6161514386418996, 'width': 0.04070230279735788, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3027, 'end': 3028, 'box': {'left': 0.40600347868887715, 'top': 0.6161514386418996, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3029, 'end': 3035, 'box': {'left': 0.41780580760857156, 'top': 0.6161514386418996, 'width': 0.04487964440024464, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3035, 'end': 3036, 'box': {'left': 0.46268545200881617, 'top': 0.6161514386418996, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3037, 'end': 3038, 'box': {'left': 0.47447171423003787, 'top': 0.6139121415736171, 'width': 0.010416576176429108, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3039, 'end': 3047, 'box': {'left': 0.0631428110657914, 'top': 0.6292950078467207, 'width': 0.052792493398020425, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3049, 'end': 3053, 'box': {'left': 0.12371292541610952, 'top': 0.6292950078467207, 'width': 0.030151837466990093, 'height': 0.01004169088915966, 'page': 1}}]
1 1 [{'start': 3055, 'end': 3064, 'box': {'left': 0.16161694489614903, 'top': 0.6292950078467207, 'width': 0.04794570602544031, 'height': 0.01004169088915966, 'page': 1}}]
2 9 [{'start': 3055, 'end': 3064, 'box': {'left': 0.16161694489614903, 'top': 0.6292950078467207, 'width': 0.04794570602544031, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3064, 'end': 3065, 'box': {'left': 0.20956265092158935, 'top': 0.6292950078467207, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3066, 'end': 3075, 'box': {'left': 0.21723583833381493, 'top': 0.6292950078467207, 'width': 0.06012961903385994, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3075, 'end': 3076, 'box': {'left': 0.2773654573676749, 'top': 0.6292950078467207, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3077, 'end': 3081, 'box': {'left': 0.2851404005368939, 'top': 0.6292950078467207, 'width': 0.026630552718402912, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3081, 'end': 3082, 'box': {'left': 0.3117709532552968, 'top': 0.6292950078467207, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3083, 'end': 3084, 'box': {'left': 0.31952045748526736, 'top': 0.6270557107784381, 'width': 0.010416576176429108, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3085, 'end': 3094, 'box': {'left': 0.3339014915098207, 'top': 0.6292950078467207, 'width': 0.061883566950456596, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3096, 'end': 3100, 'box': {'left': 0.40351983488331467, 'top': 0.6292950078467207, 'width': 0.030151837466990093, 'height': 0.01004169088915966, 'page': 1}}]
3 11 [{'start': 3207, 'end': 3216, 'box': {'left': 0.2122698896142294, 'top': 0.655651433923498, 'width': 0.06012961903385991, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3216, 'end': 3217, 'box': {'left': 0.2723995086480893, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3218, 'end': 3224, 'box': {'left': 0.282188144692546, 'top': 0.655651433923498, 'width': 0.04225541698304652, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3224, 'end': 3225, 'box': {'left': 0.3244435616755925, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3226, 'end': 3235, 'box': {'left': 0.3341866754077101, 'top': 0.655651433923498, 'width': 0.04794570602544031, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3235, 'end': 3236, 'box': {'left': 0.38213238143315037, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3237, 'end': 3241, 'box': {'left': 0.39190093410451626, 'top': 0.655651433923498, 'width': 0.030607060590381607, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3241, 'end': 3242, 'box': {'left': 0.4225079946948979, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3243, 'end': 3244, 'box': {'left': 0.43228190293242136, 'top': 0.6534121368552155, 'width': 0.010416576176429108, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3245, 'end': 3250, 'box': {'left': 0.4487569633245756, 'top': 0.655651433923498, 'width': 0.032468119830129226, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3252, 'end': 3256, 'box': {'left': 0.0631428110657914, 'top': 0.6688652949645433, 'width': 0.030151837466990107, 'height': 0.01004169088915966, 'page': 1}}]

and new (id, len(spans), spans):

0 9 [{'start': 3014, 'end': 3020, 'box': {'left': 0.3214123112304209, 'top': 0.6161514386418996, 'width': 0.0396044117350607, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3020, 'end': 3021, 'box': {'left': 0.36101672296548165, 'top': 0.6161514386418996, 'width': 0.004284452926037671, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3021, 'end': 3027, 'box': {'left': 0.3653011758915193, 'top': 0.6161514386418996, 'width': 0.04070230279735788, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3027, 'end': 3028, 'box': {'left': 0.40600347868887715, 'top': 0.6161514386418996, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3029, 'end': 3035, 'box': {'left': 0.41780580760857156, 'top': 0.6161514386418996, 'width': 0.04487964440024464, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3035, 'end': 3036, 'box': {'left': 0.46268545200881617, 'top': 0.6161514386418996, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3037, 'end': 3038, 'box': {'left': 0.47447171423003787, 'top': 0.6139121415736171, 'width': 0.010416576176429108, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3039, 'end': 3047, 'box': {'left': 0.0631428110657914, 'top': 0.6292950078467207, 'width': 0.052792493398020425, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3049, 'end': 3053, 'box': {'left': 0.12371292541610952, 'top': 0.6292950078467207, 'width': 0.030151837466990093, 'height': 0.01004169088915966, 'page': 1}}]
1 9 [{'start': 3055, 'end': 3064, 'box': {'left': 0.16161694489614903, 'top': 0.6292950078467207, 'width': 0.04794570602544031, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3064, 'end': 3065, 'box': {'left': 0.20956265092158935, 'top': 0.6292950078467207, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3066, 'end': 3075, 'box': {'left': 0.21723583833381493, 'top': 0.6292950078467207, 'width': 0.06012961903385994, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3075, 'end': 3076, 'box': {'left': 0.2773654573676749, 'top': 0.6292950078467207, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3077, 'end': 3081, 'box': {'left': 0.2851404005368939, 'top': 0.6292950078467207, 'width': 0.026630552718402912, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3081, 'end': 3082, 'box': {'left': 0.3117709532552968, 'top': 0.6292950078467207, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3083, 'end': 3084, 'box': {'left': 0.31952045748526736, 'top': 0.6270557107784381, 'width': 0.010416576176429108, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3085, 'end': 3094, 'box': {'left': 0.3339014915098207, 'top': 0.6292950078467207, 'width': 0.061883566950456596, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3096, 'end': 3100, 'box': {'left': 0.40351983488331467, 'top': 0.6292950078467207, 'width': 0.030151837466990093, 'height': 0.01004169088915966, 'page': 1}}]
2 11 [{'start': 3207, 'end': 3216, 'box': {'left': 0.2122698896142294, 'top': 0.655651433923498, 'width': 0.06012961903385991, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3216, 'end': 3217, 'box': {'left': 0.2723995086480893, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3218, 'end': 3224, 'box': {'left': 0.282188144692546, 'top': 0.655651433923498, 'width': 0.04225541698304652, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3224, 'end': 3225, 'box': {'left': 0.3244435616755925, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3226, 'end': 3235, 'box': {'left': 0.3341866754077101, 'top': 0.655651433923498, 'width': 0.04794570602544031, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3235, 'end': 3236, 'box': {'left': 0.38213238143315037, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3237, 'end': 3241, 'box': {'left': 0.39190093410451626, 'top': 0.655651433923498, 'width': 0.030607060590381607, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3241, 'end': 3242, 'box': {'left': 0.4225079946948979, 'top': 0.655651433923498, 'width': 0.0037488963102829623, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3243, 'end': 3244, 'box': {'left': 0.43228190293242136, 'top': 0.6534121368552155, 'width': 0.010416576176429108, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3245, 'end': 3250, 'box': {'left': 0.4487569633245756, 'top': 0.655651433923498, 'width': 0.032468119830129226, 'height': 0.01004169088915966, 'page': 1}}, {'start': 3252, 'end': 3256, 'box': {'left': 0.0631428110657914, 'top': 0.6688652949645433, 'width': 0.030151837466990107, 'height': 0.01004169088915966, 'page': 1}}]

and text (id, text):

0 Nabity - Grover , Cheung , & Thatcher 2020
1 Stieglitz , Mirbabaie , Ross , & Neuberger 2018
2 Mirbabaie , Bunker , Stieglitz , Marx , & Ehnis 2020