LeoFCardoso / pdf2pdfocr

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
Apache License 2.0
261 stars 33 forks source link

RuntimeError: can't start new thread #49

Closed nguyenvulong closed 1 week ago

nguyenvulong commented 3 weeks ago

I did a quick test and got this error below System information

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Linux dev4-1 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Client: Docker Engine - Community
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Sat Jun 29 00:02:33 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:33 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 nvidia:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Error log

❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./inby.pdf
Unable to find image 'leofcardoso/pdf2pdfocr:latest' locally
latest: Pulling from leofcardoso/pdf2pdfocr
37aaf24cf781: Pull complete 
da892f4d0cb0: Pull complete 
df89c9ce1e48: Pull complete 
d2a3165daa7e: Pull complete 
663286a455ab: Pull complete 
4f4fb700ef54: Pull complete 
35693ee7cdbf: Pull complete 
4215239b5448: Pull complete 
Digest: sha256:6f446c6fa612ffd304bede285556cc0190f53c6506f8a7200a69a603261643a6
Status: Downloaded newer image for leofcardoso/pdf2pdfocr:latest
-------------------------------------
File: ./inby.pdf
[2024-07-10 01:00:35.107971] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-10 01:00:35.117933] [DEBUG] Tesseract version: 4
[2024-07-10 01:00:35.144010] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-10 01:00:35.151576] [DEBUG] Qpdf version: 10.6.3
[2024-07-10 01:00:35.151798] [DEBUG] Temp dir is /tmp/pdf2pdfocr_F7DGC/
[2024-07-10 01:00:35.151836] [DEBUG] Prefix is F7DGC
[2024-07-10 01:00:35.151884] [DEBUG] Script dir is /usr/local/bin/
[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs
Traceback (most recent call last):
  File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
    pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
  File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
    self.main_pool = multiprocessing.Pool(self.cpu_to_use)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
    self._worker_handler.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
LeoFCardoso commented 3 weeks ago

Hi there, thank you for reporting.

[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs

Please try troubleshoot reducing number of cores avaliable do pdf2pdfocr. Use "-j" flag with a float number.

-j PARALLEL_PERCENT run this percentual jobs in parallel (0 - 1.0] - multiply with the number of CPU cores (default = 1 [all cores])

LeoFCardoso commented 3 weeks ago

@nguyenvulong can you please test "-j" flag?

nguyenvulong commented 3 weeks ago

Hello, I tested with your suggestion, it seems like the error still persists

❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -j0.1 -v -i  indy.pdf
-------------------------------------
File: indy.pdf
[2024-07-12 01:34:21.316715] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-12 01:34:21.326141] [DEBUG] Tesseract version: 4
[2024-07-12 01:34:21.350514] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-12 01:34:21.358522] [DEBUG] Qpdf version: 10.6.3
[2024-07-12 01:34:21.358743] [DEBUG] Temp dir is /tmp/pdf2pdfocr_2W40R/
[2024-07-12 01:34:21.358780] [DEBUG] Prefix is 2W40R
[2024-07-12 01:34:21.358826] [DEBUG] Script dir is /usr/local/bin/
[2024-07-12 01:34:21.358910] [DEBUG] Parallel operations will use 4 CPUs
Traceback (most recent call last):
  File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
    pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
  File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
    self.main_pool = multiprocessing.Pool(self.cpu_to_use)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
    self._worker_handler.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Also, I found that if I use relative path then the file will not be found

❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -j0.1 -v -i  ../input_pdf/indy.pdf

-------------------------------------
File: ../input_pdf/indy.pdf
[2024-07-12 01:35:46.805611] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-12 01:35:46.814532] [DEBUG] Tesseract version: 4
[2024-07-12 01:35:46.834198] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-12 01:35:46.839840] [DEBUG] Qpdf version: 10.6.3
Error: ../input_pdf/indy.pdf not found. Exiting.
LeoFCardoso commented 3 weeks ago

Thank you @nguyenvulong

I search for the bug and found this: https://forums.docker.com/t/runtimeerror-cant-start-new-thread/138142/3

But the "--privileged" flag with "docker run" is not recommended due to security issues.

Please try this: https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux

nguyenvulong commented 2 weeks ago

Thank you for your time. The previous issue disappeared when using the privileged flag, but it stuck at writing the output file

[2024-07-17 05:15:33.797851] [LOG] Converting input file to images...
[2024-07-17 05:15:35.053642] [LOG] Checking blank pages
[2024-07-17 05:15:35.554620] [LOG] Starting OCR with tesseract...
[2024-07-17 05:15:40.063411] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2024-07-17 05:15:43.068226] [LOG] OCR completed
[2024-07-17 05:15:43.069049] [DEBUG] We have 1 ocr'ed files
[2024-07-17 05:15:43.076980] [DEBUG] Joined ocr'ed PDF files
[2024-07-17 05:15:43.077054] [DEBUG] Merging with OCR
[2024-07-17 05:15:43.134226] [DEBUG] Autorotate skipped
[2024-07-17 05:15:43.134368] [DEBUG] Editing producer
Traceback (most recent call last):
  File "/usr/local/bin/pdf2pdfocr.py", line 1530, in <module>
    pdf2ocr.ocr()
  File "/usr/local/bin/pdf2pdfocr.py", line 733, in ocr
    self.edit_producer()
  File "/usr/local/bin/pdf2pdfocr.py", line 1370, in edit_producer
    with open(self.output_file, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/home/docker/indy-OCR.pdf'

Actually, I am more curious whether the problem I had (when running the toy example) is specific to my case - which is the limited number of allowed threads on my machine, or is it a common issue that everyone here also encountered. I also mentioned about the relative path in the previous comment. Maybe you'd want to check it out just in case.

LeoFCardoso commented 2 weeks ago

Hi @nguyenvulong

Looks like your host OS is missing write permission on your working directory (please note the use of "pwd" on command line). Please see https://docs.docker.com/storage/bind-mounts/#choose-the--v-or---mount-flag

In your testcase, working dir $(pwd) in mapping to "/home/docker". You must have write permission to generate output file.

I don't know if the thread issue is a common problem. You are the first to report. :(

About the relative paths, looks like "-v" flag of Docker don't allow ".." to navigate through folders, but relative paths starting in current folder "." should work.

nguyenvulong commented 1 week ago

Thank you Leo, I will keep an eye on this. Will reopen the issue if needed. Good day!