There was a bug in the predictor code causing us to miss matching word_ids across batches
If we were trying to find if the previous word_id matched the current word_id, we'd miss it if it was from the previous batch and the previous batch had any Nones after it.
This PR moves the code around so that we don't loop through the word_id and label_ids zips until we've looked through the whole page's words and checked if the word_id is found in the previous batch's list of word_ids.
I'm not sure if all the Nones need to be saved before zipping but that's how I was able to get it to work!
Used pdf that we error on w/ overlapping spangroups with sha 5a1f34f57771e311e8e1a2bd953263b8183a3487 (see which spans overlapped )
with no changes, mention spangroups go from 0-191 (but they can't be annotated onto the doc because of overlap)
with my beautiful changes they go from 0-189. I know 1 was junk so, I ran the original code and printed out what the spangroups texts would be if there wasn't the overlapping problem child spangroup by just filtering it out then annotating onto the doc. there were no other overlapping spangroups so it was successful, but I found the reason for the other "missing" SpanGroup and it looks like this fix adds more improvement:
the old code gets (id, text):
166 Gupta et al .
167 2020
and the new and improved code gets (id, text):
165 Gupta et al . 2020
yay!!!!!!! I fixed it!!
PS:
old overlapping spangroups (id, len(spans), spans):
Fix for https://github.com/allenai/scholar/issues/36714
There was a bug in the predictor code causing us to miss matching
word_id
s across batchesIf we were trying to find if the previous
word_id
matched the currentword_id
, we'd miss it if it was from the previous batch and the previous batch had anyNones
after it.This PR moves the code around so that we don't loop through the
word_id
andlabel_ids
zips until we've looked through the whole page's words and checked if theword_id
is found in the previous batch's list ofword_ids
.I'm not sure if all the Nones need to be saved before zipping but that's how I was able to get it to work!
Used pdf that we error on w/ overlapping spangroups with sha 5a1f34f57771e311e8e1a2bd953263b8183a3487 (see which spans overlapped )
with no changes, mention spangroups go from 0-191 (but they can't be annotated onto the doc because of overlap)
with my beautiful changes they go from 0-189. I know 1 was junk so, I ran the original code and printed out what the spangroups texts would be if there wasn't the overlapping problem child spangroup by just filtering it out then annotating onto the doc. there were no other overlapping spangroups so it was successful, but I found the reason for the other "missing" SpanGroup and it looks like this fix adds more improvement: the old code gets (id, text):
and the new and improved code gets (id, text):
yay!!!!!!! I fixed it!!
PS: old overlapping spangroups (id, len(spans), spans):
and new (id, len(spans), spans):
and text (id, text):