R0Wi-DEV / workflow_ocr

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.
GNU Affero General Public License v3.0
79 stars 6 forks source link

OCRmyPDF did not produce any output for image file #216

Closed ostasevych closed 1 year ago

ostasevych commented 1 year ago

Describe the bug The script is not working and produces error: OCR for file /username/files/Documents/289605911_567171248133643_8374688567429083270_n.jpg not possible. Message: OCRmyPDF did not produce any output

System

To Reproduce Steps to reproduce the behavior:

  1. Create workflow script to run the ocr triggered by tag, when a file is assigned the tag to_ocr.
  2. Upload the file and assign the tag to_ocr.
  3. Run sudo -u www-data php cron.php to force the cron
  4. Obtain there's no new file created
  5. Go to journal and observe error OCR for file /username/files/Documents/289605911_567171248133643_8374688567429083270_n.jpg not possible. Message: OCRmyPDF did not produce any output

Screenshots If applicable, add screenshots to help explain your problem. image image

Server log

/username/files/Documents/289605911_567171248133643_8374688567429083270_n.jpg not possible. Message: OCRmyPDF did not produce any output

R0Wi commented 1 year ago

Thanks for reporting. This is most likely a problem with ocrMyPdf itself. You could try to set your loglevel to 0 and reproduce the error again to get additional logs. Or you could try to execute a ocrmypdf command drectly on your backend system to see why it is complaining:

ocrmypdf input.jpg output.pdf
ostasevych commented 1 year ago

Thanks for reporting. This is most likely a problem with ocrMyPdf itself. You could try to set your loglevel to 0 and reproduce the error again to get additional logs. Or you could try to execute a ocrmypdf command drectly on your backend system to see why it is complaining:

ocrmypdf input.jpg output.pdf

With several tests I found that the matter in the Remove background switch. If I turn it off, everything works fine.

R0Wi commented 1 year ago

I would expect that ocrMyPdf prints some error message in that case, right? If yes, this should also show up in the logs as WARNING just before the line you mentioned here

ostasevych commented 1 year ago

I would expect that ocrMyPdf prints some error message in that case, right? If yes, this should also show up in the logs as WARNING just before the line you mentioned here

Actually, that is, what I have had... Ocrmypdf in CLI didn't reproduce any error.

R0Wi commented 1 year ago

And what if you run it with the --remove-background flag?

ostasevych commented 1 year ago

And what if you run it with the --remove-background flag?

Yup!

ocrmypdf -l ukr --remove-background --force-ocr "test_pdf_not_ocr.pdf" "test_pdf_ocr.pdf"
Scanning contents: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.15page/s]
    1 page already has text! - rasterizing text and running OCR anyway
OCR:   0%|                                                                                  | 0.0/1.0 [00:07<?, ?page/s]
An exception occurred while executing the pipeline
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 192, in exec_page_sync
    ocr_image, preprocess_out = make_intermediate_images(
  File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 126, in make_intermediate_images
    ocr_image = preprocess_out = preprocess(
  File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 104, in preprocess
    image = preprocess_remove_background(image, page_context)
  File "/usr/lib/python3/dist-packages/ocrmypdf/_pipeline.py", line 469, in preprocess_remove_background
    raise NotImplementedError("--remove-background is temporarily not implemented")
NotImplementedError: --remove-background is temporarily not implemented
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 385, in run_pipeline
    exec_concurrent(context, executor)
  File "/usr/lib/python3/dist-packages/ocrmypdf/_sync.py", line 274, in exec_concurrent
    executor(
  File "/usr/lib/python3/dist-packages/ocrmypdf/_concurrent.py", line 82, in __call__
    self._execute(
  File "/usr/lib/python3/dist-packages/ocrmypdf/builtin_plugins/concurrency.py", line 135, in _execute
    result = future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
NotImplementedError: --remove-background is temporarily not implemented

My suggestion is to hide that switcher.

R0Wi commented 1 year ago

The --remove-background flag seems to be deactivated temporarily so we expect it to be added in future ocrMyPdf releases. Since in older releases this still works, removing the switch completely doesn't seem to be a solution (see also my comment here).

Also I think it's not worth the effort to fiddle around with trying to detect the installed ocrMyPdf version, since they hopefully re-add this flag in the future. I would add a note to the README for documentation purposes which I think should be enough for the moment. If anyone wants to bring in a PR for checking the ocyMyPdf version and hiding the Remove background switch accordingly, I'd be happy to merge it.

R0Wi commented 1 year ago

Warning added to README. Closing this for now.