Closed nguyenvulong closed 1 week ago
Hi there, thank you for reporting.
[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs
Please try troubleshoot reducing number of cores avaliable do pdf2pdfocr. Use "-j" flag with a float number.
-j PARALLEL_PERCENT run this percentual jobs in parallel (0 - 1.0] - multiply with the number of CPU cores (default = 1 [all cores])
@nguyenvulong can you please test "-j" flag?
Hello, I tested with your suggestion, it seems like the error still persists
❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -j0.1 -v -i indy.pdf
-------------------------------------
File: indy.pdf
[2024-07-12 01:34:21.316715] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-12 01:34:21.326141] [DEBUG] Tesseract version: 4
[2024-07-12 01:34:21.350514] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-12 01:34:21.358522] [DEBUG] Qpdf version: 10.6.3
[2024-07-12 01:34:21.358743] [DEBUG] Temp dir is /tmp/pdf2pdfocr_2W40R/
[2024-07-12 01:34:21.358780] [DEBUG] Prefix is 2W40R
[2024-07-12 01:34:21.358826] [DEBUG] Script dir is /usr/local/bin/
[2024-07-12 01:34:21.358910] [DEBUG] Parallel operations will use 4 CPUs
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
self.main_pool = multiprocessing.Pool(self.cpu_to_use)
File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
self._worker_handler.start()
File "/usr/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Also, I found that if I use relative path
then the file will not be found
❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -j0.1 -v -i ../input_pdf/indy.pdf
-------------------------------------
File: ../input_pdf/indy.pdf
[2024-07-12 01:35:46.805611] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-12 01:35:46.814532] [DEBUG] Tesseract version: 4
[2024-07-12 01:35:46.834198] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-12 01:35:46.839840] [DEBUG] Qpdf version: 10.6.3
Error: ../input_pdf/indy.pdf not found. Exiting.
Thank you @nguyenvulong
I search for the bug and found this: https://forums.docker.com/t/runtimeerror-cant-start-new-thread/138142/3
But the "--privileged" flag with "docker run" is not recommended due to security issues.
Please try this: https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux
Thank you for your time. The previous issue disappeared when using the privileged
flag, but it stuck at writing the output file
[2024-07-17 05:15:33.797851] [LOG] Converting input file to images...
[2024-07-17 05:15:35.053642] [LOG] Checking blank pages
[2024-07-17 05:15:35.554620] [LOG] Starting OCR with tesseract...
[2024-07-17 05:15:40.063411] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2024-07-17 05:15:43.068226] [LOG] OCR completed
[2024-07-17 05:15:43.069049] [DEBUG] We have 1 ocr'ed files
[2024-07-17 05:15:43.076980] [DEBUG] Joined ocr'ed PDF files
[2024-07-17 05:15:43.077054] [DEBUG] Merging with OCR
[2024-07-17 05:15:43.134226] [DEBUG] Autorotate skipped
[2024-07-17 05:15:43.134368] [DEBUG] Editing producer
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1530, in <module>
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 733, in ocr
self.edit_producer()
File "/usr/local/bin/pdf2pdfocr.py", line 1370, in edit_producer
with open(self.output_file, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/home/docker/indy-OCR.pdf'
Actually, I am more curious whether the problem I had (when running the toy example) is specific to my case - which is the limited number of allowed threads on my machine, or is it a common issue that everyone here also encountered. I also mentioned about the relative path in the previous comment. Maybe you'd want to check it out just in case.
Hi @nguyenvulong
Looks like your host OS is missing write permission on your working directory (please note the use of "pwd" on command line). Please see https://docs.docker.com/storage/bind-mounts/#choose-the--v-or---mount-flag
In your testcase, working dir $(pwd) in mapping to "/home/docker". You must have write permission to generate output file.
I don't know if the thread issue is a common problem. You are the first to report. :(
About the relative paths, looks like "-v" flag of Docker don't allow ".." to navigate through folders, but relative paths starting in current folder "." should work.
Thank you Leo, I will keep an eye on this. Will reopen the issue if needed. Good day!
I did a quick test and got this error below System information
Error log