DARPA-ASKEM / terarium

https://app.terarium.ai
Apache License 2.0
15 stars 2 forks source link

[BUG]: Failure modes of "Extraction from documents" (Obvious equations not extracted from PDF) #4988

Open liunelson opened 3 weeks ago

liunelson commented 3 weeks ago
  1. Download this PDF https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9547654/pdf/main.pdf
  2. Upload it to Terarium
  3. Drag it onto the workflow canvas
  4. Notice that Equation (3) is not showing up in the list Image

Expected behavior This is a pretty clear-cut case of equation extraction that we should support.

Screenshots Screenshot 2024-09-30 at 2 45 36 PM

Additional issues

  1. The 4 equations in the graphical abstract don't show up either Image

  2. Most if not all equations in the figures are not extracted Image

  3. None of the equations in Table 1 is extracted Image

  4. None of the inline equations is extracted, e.g. Image

  5. An adjacent failure mode is that Greek letters representing parameters are all extracted as random floats: https://github.com/DARPA-ASKEM/terarium/issues/5061

j2whiting commented 3 weeks ago

How I'd go about debugging:

Some quick ideas for fixing once you narrow down which model is the problem...

Rerun the PDF with different padding values https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L76

I suspect that equations right at the boundary of the image may throw off the object detector.


Rerun the PDF with higher image resolutions: https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L39


j2whiting commented 3 weeks ago

Oh... another likely possibility is that the object detector is picking those up as in-line equations. Set isolated_only = False to see if they are picked up. If this is the problem it will create a fair bit of noise into the process (higher recall, lower precision).

https://github.com/DARPA-ASKEM/document_intelligence/blob/aebfa641b03dc39a61d3b957bb77970b536d5470/document_intelligence/fast_latex/run.py#L76

j2whiting commented 3 weeks ago

This shouldn't have been closed. This is a failure of the object detector, which means we should experiment with better ones or come up with well-tuned preprocessing step (i.e., image tiling) which allows us to capture these failure cases

liunelson commented 3 weeks ago

I extended the bug report with other failure examples (no extraction from figures, figure captions, tables, and inline text.