Open dfanr opened 3 weeks ago
mac only support torch 2.2.2 now which makes it impossible to install the latest package.
Did you happen to get the documentation for pdftext . I am not able to find one. If yes, then please share the same
pdftext-0.3.7
def get_text_blocks(doc, fname, max_pages: Optional[int] = None, start_page: Optional[int] = None) -> (List[Page], Dict):
toc = get_toc(doc)
if start_page:
assert start_page < len(doc)
else:
start_page = 0
if max_pages:
if max_pages + start_page > len(doc):
max_pages = len(doc) - start_page
else:
max_pages = len(doc) - start_page
page_range = range(start_page, start_page + max_pages)
char_blocks = dictionary_output(fname, page_range=page_range, keep_chars=True, workers=settings.PDFTEXT_CPU_WORKERS)
marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]
return marker_blocks, toc
The change in the repository is now taking the filename instead of the document object..
I am running marker_pdf on the Mac server, using torch=2.2.2, pdftext=0.3.10 and marker-pdf=0.2.6,python=3.11 I also encountered this problem. I adjusted the environment several times, but still had the same problem.
marker_single /Users/xuefeng/Downloads/pdf/173000004046314212.pdf /Users/xuefeng/Downloads/pdf/ --batch_multiplier 1
/opt/anaconda3/envs/marker/lib/python3.11/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
warnings.warn(msg, RuntimeWarning)
Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32
Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32
Loading reading order model vikp/surya_order on device mps with dtype torch.float16
Loaded texify model to mps with torch.float16 dtype
Traceback (most recent call last):
File "/opt/anaconda3/envs/marker/bin/marker_single", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/convert_single.py", line 26, in main
full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/marker/convert.py", line 65, in convert_single_pdf
pages, toc = get_text_blocks(
^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks
char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pdftext/extraction.py", line 75, in dictionary_output
pages = _get_pages(pdf_path, model, page_range, workers=workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pdftext/extraction.py", line 26, in _get_pages
pdf_doc = pdfium.PdfDocument(pdf_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/marker/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf
raise TypeError(f"Invalid input type '{type(input_data).__name__}'")
TypeError: Invalid input type 'PdfDocument'
I got the same issue on my Mac. I have resolved it by creating a new venv only installing PyTorch and marker then running the marker command by passing the input file path and output path only.
My Mac has an Intel chip. Torch is only supported up to version 2.2.2. This version can only support up to marker version 0.2.6 and cannot support the latest version. Marker-pdf 0.2.14 depends on torch < 3.0.0 and >= 2.2.2. Surya-ocr 0.4.12 depends on torch < 3.0.0 and >= 2.3.0. Therefore, I need to change an environment for deployment on my side.
dictionary_output()
function from pdftext only accept anpdf_path
parameter, but here you passed in an doc object.