VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
18k stars 1.04k forks source link

TypeError: Invalid input type 'PdfDocument' #235

Open Liu-XinYuan opened 4 months ago

Liu-XinYuan commented 4 months ago

I encountered the following error when running the following command: (venv) (base) MacBook-Pro-2:contract-master dylan$ marker_single /Users/dylan/xxxx.pdf /Users/dylan --language Chinese Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32 Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32 Loading reading order model vikp/surya_order on device mps with dtype torch.float16 Loaded texify model to mps with torch.float16 dtype Traceback (most recent call last): File "/Users/dylan/ai/contract-master/venv/bin/marker_single", line 8, in <module> sys.exit(main()) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/convert_single.py", line 26, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/marker/convert.py", line 65, in convert_single_pdf pages, toc = get_text_blocks( File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pdftext/extraction.py", line 75, in dictionary_output pages = _get_pages(pdf_path, model, page_range, workers=workers) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pdftext/extraction.py", line 26, in _get_pages pdf_doc = pdfium.PdfDocument(pdf_path) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf raise TypeError(f"Invalid input type '{type(input_data).__name__}'") TypeError: Invalid input type 'PdfDocument'

stupidcupid commented 3 months ago

same problem ;)

iksk commented 3 months ago

same problem ;)

MissTeven commented 3 months ago

same problem ;)

kenZhangCn commented 3 months ago

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6(mac-inter), reference #183

mara004 commented 1 month ago

Looks like some caller tries to pass a PdfDocument instance as input to a new PdfDocument, which is nonsense. If you already have a document handle, use it.

Update: see VikParuchuri's answer in https://github.com/VikParuchuri/pdftext/pull/10#issuecomment-2400925004: "I think the issues there were with mismatched pdftext/marker versions"

scottgigante-sightline commented 2 weeks ago

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6(mac-inter), reference #183

This is the right answer :) For those of us still using Python 3.11, I'd love to see a 0.2.6.post1 which pins pdftext to 0.3.7 :)