VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.15k stars 720 forks source link

TypeError: Invalid input type 'PdfDocument' #137

Closed Athe-kunal closed 1 month ago

Athe-kunal commented 1 month ago

I took the convert_single.py file and ran the following code on Google colab after doing pip install marker-pdf

import argparse
import os

from marker.convert import convert_single_pdf
from marker.logger import configure_logging
from marker.models import load_all_models

from marker.output import save_markdown

model_lst = load_all_models()
langs = ["en"]
batch_multiplier = 2
fname = "/content/input/temp.pdf"
output = "output_dir"
os.makedirs(output,exist_ok=True)
full_text, images, out_meta = convert_single_pdf(str(fname), model_lst, max_pages=1,langs=langs, batch_multiplier=batch_multiplier)

fname = os.path.basename(fname)
subfolder_path = save_markdown(output, fname, full_text, images, out_meta)

I am getting the following error

TypeError: Invalid input type 'PdfDocument'

The full error trace is here

Traceback (most recent call last):
  File "/usr/local/bin/marker_single", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/convert_single.py", line 26, in main
    full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier)
  File "/usr/local/lib/python3.10/dist-packages/marker/convert.py", line 66, in convert_single_pdf
    pages, toc = get_text_blocks(
  File "/usr/local/lib/python3.10/dist-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks
    char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True)
  File "/usr/local/lib/python3.10/dist-packages/pdftext/extraction.py", line 65, in dictionary_output
    pages = _get_pages(pdf_path, model, page_range, workers=workers)
  File "/usr/local/lib/python3.10/dist-packages/pdftext/extraction.py", line 26, in _get_pages
    pdf_doc = pdfium.PdfDocument(pdf_path)
  File "/usr/local/lib/python3.10/dist-packages/pypdfium2/_helpers/document.py", line 78, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
  File "/usr/local/lib/python3.10/dist-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf
    raise TypeError(f"Invalid input type '{type(input_data).__name__}'")
TypeError: Invalid input type 'PdfDocument'

Also, I tried with the command line, and I got this error. How can I resolve it?

VikParuchuri commented 1 month ago

Run pip install -U marker-pdf. Should fix it

tangxqa commented 1 month ago

I encountered the same problem, but I have already taken action: pip install -U marker-pdf。 run: marker_single /Users/tangxqa/Downloads/Demystifying_the_Topologies_Behind_prompting_1706394504.pdf ./output --batch_multiplier 1 --max_pages 10 --langs Chinese


/opt/anaconda3/lib/python3.11/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

  warnings.warn(msg, RuntimeWarning)
Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32
Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32
Loading reading order model vikp/surya_order on device mps with dtype torch.float16
Loaded texify model to mps with torch.float16 dtype
Traceback (most recent call last):
  File "/opt/anaconda3/bin/marker_single", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/convert_single.py", line 26, in main
    full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/marker/convert.py", line 65, in convert_single_pdf
    pages, toc = get_text_blocks(
                 ^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks
    char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pdftext/extraction.py", line 65, in dictionary_output
    pages = _get_pages(pdf_path, model, page_range, workers=workers)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pdftext/extraction.py", line 26, in _get_pages
    pdf_doc = pdfium.PdfDocument(pdf_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf
    raise TypeError(f"Invalid input type '{type(input_data).__name__}'")
TypeError: Invalid input type 'PdfDocument'
VikParuchuri commented 1 month ago

The code you're showing in the traceback is not from the latest version. Check if you have 0.2.9

xzdong-2019 commented 1 month ago

marker-pdf (0.2.9) but i have the same issue on linux.
marker-pdf(0.2.8) is ok on windows.

xzdong-2019 commented 1 month ago

marker-pdf (0.2.9) but i have the same issue on linux.
marker-pdf(0.2.8) is ok on windows.

junnyyip188 commented 3 weeks ago

marker-pdf (0.2.9-0.2.13) have the same issue on ubuntu22.04(wsl2)