VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
13.97k stars 707 forks source link

TypeError in batch processing: #194

Closed PeterAJansen closed 2 weeks ago

PeterAJansen commented 3 weeks ago

I'm seeing a type error with the following command from the README:

marker pdf/ marker/ --workers 5 --min_length 1000

Error:

Loaded detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loaded detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loaded reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded recognition model vikp/surya_rec on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Converting 69 pdfs in chunk 1/1 with 4.0 processes, and storing in /home/peter/github/discovery-knowledge-graph/papers/marker
Traceback (most recent call last):
  File "/home/peter/anaconda3/envs/dkg/bin/marker", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/peter/anaconda3/envs/dkg/lib/python3.11/site-packages/convert.py", line 123, in main
    with mp.Pool(processes=total_processes, initializer=worker_init, initargs=(model_lst,)) as pool:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peter/anaconda3/envs/dkg/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peter/anaconda3/envs/dkg/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/home/peter/anaconda3/envs/dkg/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/peter/anaconda3/envs/dkg/lib/python3.11/multiprocessing/pool.py", line 321, in _repopulate_pool_static
    for i in range(processes - len(pool)):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'float' object cannot be interpreted as an integer
PeterAJansen commented 3 weeks ago

It looks like using the recommended pip install marker-pdf causes this, but pulling the current repo and installing via pip install -e . doesn't -- so it's likely that the version available on pip is older?

VikParuchuri commented 2 weeks ago

Yes, will release soon to fix this