allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Bib predictor index error bug fix #186

Closed geli-gel closed 1 year ago

geli-gel commented 1 year ago

attempt at fix for part 1 of https://github.com/allenai/scholar/issues/34858 I think we can work around the index error that keeps popping up this way.

tt verify integration test passes

next steps:

geli-gel commented 1 year ago

I know it happened for example with this pdf: 3cf45514384bbb7d083ae53e19bdc22300e648ab / https://s3.console.aws.amazon.com/s3/object/ai2-s2-pdfs?region=us-west-2&prefix=3cf4/5514384bbb7d083ae53e19bdc22300e648ab.pdf From reading the code I believe it means we are trying to get tokens from a bib entry spangroup (that was an mmda generated spangroup from a box that the detector model gives us) that has no tokens, but i didn't actually visualize the boxes we get for this PDF to confirm that suspicion. I'll see if I can easily do that somehow... or just see what the text is for the spangroups, or maybe even just seeing the spans in anno store would be helpful enough....

geli-gel commented 1 year ago

Haven't merged yet cause I wanted to prove my suspicion. Haven't done it yet, because unable to pull annos w/ spp_client for the example pdf, uncovering another bug. The problem seems to be that there are a mismatch in IDs of bibentry boxes and bibentry spangroups, which does point to my original suspicion (of model box having no tokens, therefore no spans to generate a spangroup), but since it causes problems with spp_client too I want to investigate more before deciding what to do.

kyleclo commented 1 year ago

@geli-gel is there sthn i can help w/ here?

geli-gel commented 1 year ago

Sure thanks Kyle, here's what I have so far: I think that I recall seeing some code in mmda that makes text from spans unavailable to future spangroups in the same set of annos. I'm wondering if that's what's causing these spangroups with no spans -- for example we have 2 boxes drawn in the same area, however our box with id=0 ends up having nothing in anno store since it has no spans:

image image image
geli-gel commented 1 year ago

TODO:

also - upstream fix could be filtering bib detection output to only spangroups that contain spans

kyleclo commented 1 year ago

@geli-gel

I think that I recall seeing some code in mmda that makes text from spans unavailable to future spangroups in the same set of annos.

can u point to what you're referring to? i don't quite understand this

I'm wondering if that's what's causing these spangroups with no spans -- for example we have 2 boxes drawn in the same area, however our box with id=0 ends up having nothing in anno store since it has no spans

can u give me the PDF that's generating this?

geli-gel commented 1 year ago

@kyleclo sure, the paper above is sha 3cf45514384bbb7d083ae53e19bdc22300e648ab https://s3.console.aws.amazon.com/s3/object/ai2-s2-pdfs?prefix=3cf4%2F5514384bbb7d083ae53e19bdc22300e648ab.pdf&region=us-west-2# and this is the code I'm thinking of: https://github.com/allenai/mmda/blob/56b715da485ef7d577bdc322f05d09f143c27590/src/mmda/types/document.py#L182 But also, I think the solution is to stop returning these overlapping boxes at their source - the model (filter returned bib entry SpanGroups to only ones that contain spans (text)).