Open liunelson opened 3 weeks ago
How I'd go about debugging:
Some quick ideas for fixing once you narrow down which model is the problem...
Rerun the PDF with different padding values https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L76
I suspect that equations right at the boundary of the image may throw off the object detector.
Rerun the PDF with higher image resolutions: https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L39
Oh... another likely possibility is that the object detector is picking those up as in-line equations. Set isolated_only = False
to see if they are picked up. If this is the problem it will create a fair bit of noise into the process (higher recall, lower precision).
This shouldn't have been closed. This is a failure of the object detector, which means we should experiment with better ones or come up with well-tuned preprocessing step (i.e., image tiling) which allows us to capture these failure cases
I extended the bug report with other failure examples (no extraction from figures, figure captions, tables, and inline text.
Expected behavior This is a pretty clear-cut case of equation extraction that we should support.
Screenshots
Additional issues
The 4 equations in the graphical abstract don't show up either
Most if not all equations in the figures are not extracted
None of the equations in Table 1 is extracted
None of the inline equations is extracted, e.g.
An adjacent failure mode is that Greek letters representing parameters are all extracted as random floats: https://github.com/DARPA-ASKEM/terarium/issues/5061