VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
18.12k stars 1.05k forks source link

OCR_ENGINE=None Doesn't work #256

Open svmrw opened 3 months ago

svmrw commented 3 months ago

Hello. The Readme says the following:

By default, marker will use surya for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set OCR_ENGINE to ocrmypdf. This also requires external dependencies (see above). If you don't want OCR at all, set OCR_ENGINE to None.

export OCR_ENGINE=None
marker_single ./file.pdf ./marker

Running the command gives the following:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
OCR_ENGINE
  Input should be 'surya' or 'ocrmypdf' [type=literal_error, input_value='None', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/literal_error

I really want to convert pdf to markdown, but not use OCR. Almost all pdf files have text that can be selected and copied, and embedded images need to be kept original. It seems to me that the whole document does not need to be recognized as an image if the text is easy to copy.

Please tell me, is this somehow possible or impossible? Maybe it was supported before, but now it is not? Or maybe I am doing something wrong? Thanks.

svmrw commented 3 months ago

257

I tried to make changes manually based on your commit. The error is no longer displayed, but... OCR Surya still loads and recognizes the whole file. Ie: OCR_ENGINE=None and OCR_ENGINE=Surya work the same. No changes are visible. I most likely assume that I am doing something wrong, so I ask you to check it yourself.

kyr0 commented 2 months ago

Running into the same and as OCR runs my machine into max memory, I need to use a different software now.. dead end

svmrw commented 1 month ago

The problem is still relevant. Changes from here did not help at all either.

Personally, I don't care about performance. The thing is that OCR recognition spoils embedded images. So I would like OCR_ENGINE=None to work.