VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
16.8k stars 955 forks source link

IndexError: list index out of range (New) #279

Open wooemans opened 3 weeks ago

wooemans commented 3 weeks ago
xiaoran@xiaorandeMacBook-Pro ~ % marker_single "/Users/xiaoran/Downloads/Decree books/_I AM_ DECREE BOOKLET - BOOK 4.pdf" "/Users/xiaoran/Downloads" --batch_multiplier 2 --langs English
Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32
Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32
Loading reading order model vikp/surya_order on device cpu with dtype torch.float32
Loaded texify model to cpu with torch.float32 dtype
Detecting bboxes: 100%|███████████████████████| 52/52 [1:20:39<00:00, 93.07s/it]
Loading recognition model vikp/surya_rec on device cpu with dtype torch.float32
Recognizing Text: 100%|███████████████████████████| 4/4 [04:48<00:00, 72.10s/it]
Detecting bboxes: 100%|██████████████████████| 35/35 [1:07:42<00:00, 116.07s/it]
Finding reading order: 100%|████████████████████| 35/35 [47:59<00:00, 82.28s/it]
Traceback (most recent call last):
  File "/usr/local/bin/marker_single", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/site-packages/convert_single.py", line 26, in main
    full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/marker/convert.py", line 134, in convert_single_pdf
    extract_images(doc, pages)
  File "/usr/local/lib/python3.12/site-packages/marker/images/extract.py", line 72, in extract_images
    extract_page_images(page_obj, page)
  File "/usr/local/lib/python3.12/site-packages/marker/images/extract.py", line 42, in extract_page_images
    block = page.blocks[block_idx]
            ~~~~~~~~~~~^^^^^^^^^^^
IndexError: list index out of range

Attachment 1 is the PDF file. I have tried removing the blank pages from this file, but the issue persists. Some files with more pages than this one can be successfully converted. However, some files with fewer pages also encounter this issue during conversion, such as the file in Attachment 2.

Could you please advise on how to resolve this issue? I have already installed the latest version of Marker, but the problem still remains. I AM DECREE BOOKLET - BOOK 4.pdf I AM DECREES SERIES 2 (1).pdf