VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
13.97k stars 707 forks source link

Fix error with total_processes being cast to float #165

Closed tosaddler closed 1 month ago

tosaddler commented 1 month ago

When calculating total_processes, type is cast to float which causes errors with torch.multiprocessing.

Converting 39 pdfs in chunk 3/4 with 7.0 processes, and storing in /data/ntp-technical-reports-md/lt_rpts/md
Traceback (most recent call last):
  File "/home/saddlerto/.asdf/installs/python/3.10.10/bin/marker", line 8, in <module>
    sys.exit(main())
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/site-packages/convert.py", line 123, in main
    with mp.Pool(processes=total_processes, initializer=worker_init, initargs=(model_lst,)) as pool:
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/pool.py", line 321, in _repopulate_pool_static
    for i in range(processes - len(pool)):
TypeError: 'float' object cannot be interpreted as an integer
Loaded texify model to cuda with torch.float16 dtype
Converting 36 pdfs in chunk 4/4 with 7.0 processes, and storing in /data/ntp-technical-reports-md/lt_rpts/md
Traceback (most recent call last):
  File "/home/saddlerto/.asdf/installs/python/3.10.10/bin/marker", line 8, in <module>
    sys.exit(main())
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/site-packages/convert.py", line 123, in main
    with mp.Pool(processes=total_processes, initializer=worker_init, initargs=(model_lst,)) as pool:
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/home/saddlerto/.asdf/installs/python/3.10.10/lib/python3.10/multiprocessing/pool.py", line 321, in _repopulate_pool_static
    for i in range(processes - len(pool)):
TypeError: 'float' object cannot be interpreted as an integer
github-actions[bot] commented 1 month ago

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

tosaddler commented 1 month ago

I have read the CLA document and I hereby sign the CLA

VikParuchuri commented 1 month ago

Thanks for the fix @tosaddler !