VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.66k stars 764 forks source link

Tesseract is run despite text layer being present #83

Closed lvsass closed 2 months ago

lvsass commented 5 months ago

Self explanatory title – I'm using Marker on a PDF that has embedded text but it is still running Tesseract. I don't think this behavior is expected? Is there a way to explicitly turn off OCR?

VikParuchuri commented 2 months ago

I'm fixing the ocr heuristics in the next version. I'll also add a way to disable OCR entirely.