Closed geli-gel closed 1 year ago
I know it happened for example with this pdf: 3cf45514384bbb7d083ae53e19bdc22300e648ab / https://s3.console.aws.amazon.com/s3/object/ai2-s2-pdfs?region=us-west-2&prefix=3cf4/5514384bbb7d083ae53e19bdc22300e648ab.pdf From reading the code I believe it means we are trying to get tokens from a bib entry spangroup (that was an mmda generated spangroup from a box that the detector model gives us) that has no tokens, but i didn't actually visualize the boxes we get for this PDF to confirm that suspicion. I'll see if I can easily do that somehow... or just see what the text is for the spangroups, or maybe even just seeing the spans in anno store would be helpful enough....
Haven't merged yet cause I wanted to prove my suspicion. Haven't done it yet, because unable to pull annos w/ spp_client for the example pdf, uncovering another bug. The problem seems to be that there are a mismatch in IDs of bibentry boxes and bibentry spangroups, which does point to my original suspicion (of model box having no tokens, therefore no spans to generate a spangroup), but since it causes problems with spp_client too I want to investigate more before deciding what to do.
@geli-gel is there sthn i can help w/ here?
Sure thanks Kyle, here's what I have so far: I think that I recall seeing some code in mmda that makes text from spans unavailable to future spangroups in the same set of annos. I'm wondering if that's what's causing these spangroups with no spans -- for example we have 2 boxes drawn in the same area, however our box with id=0 ends up having nothing in anno store since it has no spans:
TODO:
also - upstream fix could be filtering bib detection output to only spangroups that contain spans
@geli-gel
I think that I recall seeing some code in mmda that makes text from spans unavailable to future spangroups in the same set of annos.
can u point to what you're referring to? i don't quite understand this
I'm wondering if that's what's causing these spangroups with no spans -- for example we have 2 boxes drawn in the same area, however our box with id=0 ends up having nothing in anno store since it has no spans
can u give me the PDF that's generating this?
@kyleclo sure, the paper above is sha 3cf45514384bbb7d083ae53e19bdc22300e648ab https://s3.console.aws.amazon.com/s3/object/ai2-s2-pdfs?prefix=3cf4%2F5514384bbb7d083ae53e19bdc22300e648ab.pdf®ion=us-west-2# and this is the code I'm thinking of: https://github.com/allenai/mmda/blob/56b715da485ef7d577bdc322f05d09f143c27590/src/mmda/types/document.py#L182 But also, I think the solution is to stop returning these overlapping boxes at their source - the model (filter returned bib entry SpanGroups to only ones that contain spans (text)).
attempt at fix for part 1 of https://github.com/allenai/scholar/issues/34858 I think we can work around the index error that keeps popping up this way.
tt verify integration test passes
next steps: