VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.15k stars 720 forks source link

[Bug] Incomplete PDF Conversion When Processing Multiple Files #159

Open pranshuchaurasia opened 1 month ago

pranshuchaurasia commented 1 month ago

Issue Summary: When attempting to batch convert multiple PDF files using the marker command, not all files in the specified directory are processed. Specifically, when the directory contains 20 PDF files, only 15 are converted, despite using appropriate flags to handle multiple files. I tried with different number of pdf same, the result was the same.

Command Used: marker /path/to/input/folder /path/to/output/folder --workers 10

Expected Behavior: All specified PDF files should be processed and converted to markdown when --max command was not specified.

Actual Behavior: Only 15 out of 20 PDF files are processed and converted. The remaining 5 files are only successfully converted when processed individually rather than as part of the batch.

Additional Information: (1) No error messages are output by marker when the issue occurs. (2) Individual processing of each of the 5 unconverted files succeeds with no issues. (3) This behavior is consistent across multiple attempts with different sets of PDF files.

Kieran-who commented 1 month ago

Just wanted to also note that I faced issues with not all files converting. I had a folder with 2000+ files, and the output was missing roughly 10. I extracted those 10 missing files and tried to run the batch on them, but the same outcome didn't convert. So, there must be some issue with handling the files themselves. I can send those files if helpful.