Stirling-Tools / Stirling-PDF

#1 Locally hosted web application that allows you to perform various operations on PDF files
GNU General Public License v3.0
29.63k stars 2.17k forks source link

Cleanup Scans / OCR error #1329

Open marcofenoglio opened 1 month ago

marcofenoglio commented 1 month ago

Docker Version 0.25.0 in Ubuntu 16.04TLS

Error log:

08:18:50.766 [qtp2053647669-44] INFO  s.s.SPDF.utils.ProcessExecutor - Running command: ocrmypdf --verbose 2 --output-type pdf --pdf-renderer hocr --deskew --clean --skip-text --language eng /tmp/input_5947355544812649658.pdf /tmp/output_7993803185626034787.pdf
08:18:53.263 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf - ocrmypdf 16.1.1
08:18:53.263 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['unpaper', '--version']
08:18:53.976 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Found unpaper 7.0.0
08:18:53.976 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
08:18:54.015 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4
08:18:54.015 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
08:18:54.023 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
08:18:54.084 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Found gs 10.2.1
08:18:54.084 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
08:18:54.099 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
08:18:54.134 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
08:18:54.135 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor - [DS] Device[1] 0:(null) score is 0.261972
08:18:54.139 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor - [DS] Selected Device[1]: "(null)" (Native)
08:18:54.139 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor - List of available languages in "/usr/share/tessdata/" (1):
08:18:54.139 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor - eng
08:18:54.139 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   DEBUG ocrmypdf.helpers - pikepdf mmap enabled
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   ERROR ocrmypdf._pipelines._common - ExitCodeException
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor - Traceback (most recent call last):
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -     return fn(options, plugin_manager)
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
08:18:54.140 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 166, in _run_pipeline
08:18:54.141 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -     check_requested_output_file(options)
08:18:54.141 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -   File "/usr/lib/python3.12/site-packages/ocrmypdf/_validation.py", line 310, in check_requested_output_file
08:18:54.141 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor -     raise OutputFileAccessError(
08:18:54.141 [Thread-15] INFO  s.s.SPDF.utils.ProcessExecutor - ocrmypdf.exceptions.OutputFileAccessError: Output file location (/tmp/output_7993803185626034787.pdf) is not a writable file.
08:18:54.243 [qtp2053647669-44] WARN  o.e.j.ee10.servlet.ServletChannel - handleException /api/v1/misc/ocr-pdf java.io.IOException: Command process failed with exit code 5. Error message:   DEBUG ocrmypdf - ocrmypdf 16.1.1
  DEBUG ocrmypdf.subprocess - Running: ['unpaper', '--version']
  DEBUG ocrmypdf.subprocess - Found unpaper 7.0.0
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Found tesseract 5.3.4
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Found gs 10.2.1
  DEBUG ocrmypdf.subprocess - Running: ['gs', '--version']
  DEBUG ocrmypdf.subprocess - Running: ['tesseract', '--list-langs']
  DEBUG ocrmypdf.subprocess.tesseract - stdout/stderr = [DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.261972
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (1):
eng

  DEBUG ocrmypdf.helpers - pikepdf mmap enabled
  ERROR ocrmypdf._pipelines._common - ExitCodeException
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in cli_exception_handler
    return fn(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 166, in _run_pipeline
    check_requested_output_file(options)
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_validation.py", line 310, in check_requested_output_file
    raise OutputFileAccessError(
ocrmypdf.exceptions.OutputFileAccessError: Output file location (/tmp/output_7993803185626034787.pdf) is not a writable file.
Frooodle commented 1 month ago

Based on write errors i assume its a permission issue, are you running it as a certain user or GUID etc that might be causing issues?

Frooodle commented 1 month ago

or volume mapping the /tmp file causing some issue etc

marcofenoglio commented 4 weeks ago

This is my docker command:

docker run -d  \
  -p 9284:4443 \
  -e DOCKER_ENABLE_SECURITY=false \
  -e INSTALL_BOOK_AND_ADVANCED_HTML_OPS=false \
  -e LANGS=it_IT,en_GB \
  -v /mnt/data/docker_data/stirling-pdf/trainingData:/usr/share/tessdata \
  -v /mnt/data/docker_data/stirling-pdf/extraConfigs:/configs \
  -v /mnt/data/docker_data/stirling-pdf/logs:/logs \
  -v /etc/letsencrypt/live/<mydomain>/cert.pem:/configs/cert.pem \
  -v /etc/letsencrypt/live/<mydomain>/privkey.pem:/configs/privkey.pem \
  --name stirling-pdf \
  --restart unless-stopped \
  frooodle/s-pdf:latest

GUID and UID are the default. And this is the volume directory stucture:

stirling-pdf/:
total 12
drwxr-xr-x 2 1000 1000 4096 May 29 11:15 extraConfigs
drwxr-xr-x 2 1000 1000 4096 Jun  2 22:19 logs
drwxr-xr-x 4 1000 1000 4096 May 29 09:52 trainingData

stirling-pdf/extraConfigs:
total 8
-rwxr-xr-x 1 1000 1000    0 May 29 10:22 cert.pem
-rwxr-xr-x 1 1000 1000  155 May 29 10:35 custom_settings.yml
-rwxr-xr-x 1 1000 1000    0 May 29 10:22 privkey.pem
-rwxr-xr-x 1 1000 1000 3633 Jun  2 22:19 settings.yml

stirling-pdf/logs:
total 8
-rw-r--r-- 1 1000 1000 4103 Jun  2 22:19 info.log
-rwxr-xr-x 1 1000 1000    0 May 29 09:52 invalid-auths.log

stirling-pdf/trainingData:
total 22932
drwxr-xr-x 2 1000 1000     4096 May 29 09:52 configs
-rwxr-xr-x 1 1000 1000 23466654 May 29 09:52 eng.traineddata
-rwxr-xr-x 1 1000 1000      572 May 29 09:52 pdf.ttf
drwxr-xr-x 2 1000 1000     4096 May 29 09:52 tessconfigs
Frooodle commented 4 weeks ago

Looks like a bug with the "Correct pages were scanned at a skewed angle by rotating them back into place" feature if you have that enabled

marcofenoglio commented 4 weeks ago

Same error without "deskew" option.