[BUG]: Failure modes of "Extraction from documents" (Obvious equations not extracted from PDF)

DARPA-ASKEM / terarium

https://app.terarium.ai

Apache License 2.0

15 stars 2 forks source link

[BUG]: Failure modes of "Extraction from documents" (Obvious equations not extracted from PDF) #4988

Open liunelson opened 3 weeks ago

liunelson commented 3 weeks ago

Download this PDF https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9547654/pdf/main.pdf
Upload it to Terarium
Drag it onto the workflow canvas
Notice that Equation (3) is not showing up in the list

Expected behavior This is a pretty clear-cut case of equation extraction that we should support.

Screenshots Screenshot 2024-09-30 at 2 45 36 PM

Additional issues

The 4 equations in the graphical abstract don't show up either
Most if not all equations in the figures are not extracted
None of the equations in Table 1 is extracted
None of the inline equations is extracted, e.g.
An adjacent failure mode is that Greek letters representing parameters are all extracted as random floats: https://github.com/DARPA-ASKEM/terarium/issues/5061

j2whiting commented 3 weeks ago

How I'd go about debugging:

Figure out which model is the problem, is it the object detector or the image -> text transformer?
Inspect the images that the object detector is outputting - if its missing these equations, its the problem. https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L29

Some quick ideas for fixing once you narrow down which model is the problem...

Rerun the PDF with different padding values https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L76

I suspect that equations right at the boundary of the image may throw off the object detector.

Rerun the PDF with higher image resolutions: https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L39

j2whiting commented 3 weeks ago

Oh... another likely possibility is that the object detector is picking those up as in-line equations. Set isolated_only = False to see if they are picked up. If this is the problem it will create a fair bit of noise into the process (higher recall, lower precision).

https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L76

j2whiting commented 3 weeks ago

This shouldn't have been closed. This is a failure of the object detector, which means we should experiment with better ones or come up with well-tuned preprocessing step (i.e., image tiling) which allows us to capture these failure cases

liunelson commented 3 weeks ago

I extended the bug report with other failure examples (no extraction from figures, figure captions, tables, and inline text.

DARPA-ASKEM / terarium

[BUG]: Failure modes of "Extraction from documents" (Obvious equations not extracted from PDF) #4988

Inspect the images that the object detector is outputting - if its missing these equations, its the problem. https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L29