VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.12k stars 717 forks source link

Few Concerns on the Markdown Generation - Overlapping image/table/text boxes and Different output while using Surya #148

Open Curiosity007 opened 1 month ago

Curiosity007 commented 1 month ago
  1. Is it possible to not create overlapping bboxes, because that will help to identify the elements much easier. One such example like below -

image

  1. This image was using Surya Streamlit version. But when I run the same pdf through marker, the extracted image is very different, and it is actually truncated. Marker and Surya, these two repos are in sync? (I think Surya repo is using more recent layout model than marker)

  2. Is it possible to increase the bbox tolerance as an configurable argument, so, it can detect little more surrounding areas, when image detection is wrong and only gets cropped images

  3. Is it possible to extract tables as images, rather than directly printing it on the md file, at least a configurable option? Because table detection is not top notch.

  4. How to ensure, whatever is in table / image, text inside it is not again repeated in the md file?

homeant commented 1 month ago

layout_0 I had the same problem