How to install ocrmypdf?

VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy

GNU General Public License v3.0

17.79k stars 1.02k forks source link

Im running Windows 10. Currently given directions not elaborative enough. Please try to install ocrmypdf and use it with marker without getting errors. Where the hell do I find the root folder of marker if I installed it using command given below and not using some kind of visual studio environment (#316)?

Installed marker: pip install marker-pdf https://github.com/VikParuchuri/marker/blob/master/docs/install_ocrmypdf.md
Installed ocrmypdf: winget install -e --id Python.Python.3.11 winget install -e --id UB-Mannheim.TesseractOCR installed ghostscript python3 -m pip install ocrmypdf
set two variables for ocrmypdf to be used: set OCR_ALL_PAGES=true set OCR_ENGINE=ocrmypdf
trying to launch marker_single: marker_single input.pdf C:/output/folder --langs Greek,Lithuanian
Getting errors. 5.1 Trying to resolve: Introduced a new variable: set TESSDATA_PREFIX="C:\Program Files\Tesseract-OCR\tessdata"
Errors again, frustration starts and hopefully ends here (with your help). Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.75s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).

In CMD I wrote (found it in #162): pip show marker-pdf Got this (bold is what I needed):

Name: marker-pdf Version: 0.3.10 Summary: Convert PDF to markdown with high speed and accuracy. Home-page: https://github.com/VikParuchuri/marker Author: Vik Paruchuri Author-email: github@vikas.sh License: GPL-3.0-or-later Location: C:\Users\Wasup\miniconda3\Lib\site-packages Requires: filetype, ftfy, pdftext, Pillow, pydantic, pydantic-settings, python-dotenv, rapidfuzz, regex, surya-ocr, tabled-pdf, tabulate, texify, torch, tqdm, transformers

Went to C:\Users\Wasup\miniconda3\Lib\site-packages and found undocumented settings.py file. Opened with notepad++ and found the needed lines and filled in the TESSDATA_PREFIX value:

OCR_PARALLEL_WORKERS: int = 2 # How many CPU workers to use for OCR TESSERACT_TIMEOUT: int = 20 # When to give up on OCR TESSDATA_PREFIX: str = "C:\Program Files\Tesseract-OCR\tessdata"

Now atleast marker recognizes that there is some kind of TESSDATA_PREFIX:

C:\Users\Wasup>marker_single input.pdf C:/output/folder --langs Greek,Lithuanian C:\Users\Wasup\miniconda3\Lib\site-packages\marker\settings.py:59: SyntaxWarning: invalid escape sequence '\P' **TESSDATA_PREFIX: str = "C\Program Files\Tesseract-OCR\tessdata"** Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.33s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\Wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\Wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).

VikParuchuri / marker

How to install ocrmypdf? #361