VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
16.77k stars 953 forks source link

Best pdf extractor I have seen, but still not accurate enough #170

Open Crestina2001 opened 4 months ago

Crestina2001 commented 4 months ago

Thanks for your great work! But it still has some problems. I have a PDF, which is not scanned(you can select the words in the files). When using your method, it will recognize 'benefit' as 'benets'. It is strange in that when I use Foxit PDF editor, it will also do so, but when I use pymupdf, it just works fine. So it may be due to the issues of some specific packages.

In addition, there are still some issues with tables. So after using the pipeline, you still need to adjust the tables manually in the markdown to make sure they are correct. I don't have ideas how this could be improved. Just where to put the bounding box for table extraction is intimidating for me.

VikParuchuri commented 3 months ago

You can do OCR_ALL_PAGES=true to force OCR. Some PDFs will have had OCR run on them and text added (so you can select it), and that text can be bad if the OCR engine was not good.