Closed kevinkuan1969 closed 1 year ago
Hi, please follow the guide from https://github.com/R0Wi-DEV/workflow_ocr#troubleshooting. After you decreased the loglevel, try to trigger the OCR workflow by uploading a new file. Right after that, check the oc_jobs
table, it should contain a job for the asynchonous OCR process. If it contains the job, please execute your cron.php
. This should trigger the actual ocrmypdf
process.
You can check the PDF itself by having a look at the NC versions, you should see that the OCR process has created another file version (see https://github.com/R0Wi-DEV/workflow_ocr#troubleshooting). You can also check the PDF metadata or properties. You will see that the "creator" is set so ocrmypdf
if the file has been processed with the OCR tool:
If you can, please share your server logs here after you completed the whole process.
Hope this helps!
Thanks @R0Wi for your prompt reply but the below problems are still the same.
Here the log file with filter "WorkflowOcr"
Here the sample pdf my country ID and trying to OCR all the text inside the card.
Ok I think now I understood your problem. Seems that the NC OCR process is just working fine but you're PDF result is not like expected, right?
If you have a look at your server logs, you see one line where you can extract the actual ocrmypdf
command:
Running command: ocrmypdf -q --force-ocr -l eng+chi_sim+chi_tra -j 2 --sidecar /tmp/oc_tmp_FG9iP0-.sidecar - - | cat
So I tried to do this directly inside of my commandline without the -q
-flag to get some warning messages and stuff. Got the following:
docker@9b4c2579cc58:/var/www/html/nextcloud$ ocrmypdf --force-ocr --sidecar sidecar.txt myAI-ID.pdf out.pdf
Scan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 146.52page/s]
INFO - Using Tesseract OpenMP thread limit 3
INFO - 1: page already has text! - rasterizing text and running OCR anyway
OCR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 1.31page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
INFO - Optimize ratio: 1.00 savings: 0.0%
INFO - Output file is a PDF/A-2B (as expected)
docker@9b4c2579cc58:/var/www/html/nextcloud$ cat sidecar.txt
Identification
Document — myAl
Cloud
So as you can see, the text inside of the identity card is indeed not processed by ocrmypdf
. Unfortunately I don't know why this doesn't work but it's definitely rather a problem with ocrmypdf
than with this NC app here. So I'd suggest you reach out to the guys from ocrmypdf
directly and ask for help. Feel free to share the info here.
Also I'd maybe try some other PDF documents, maybe there's just something wrong with this one, I don't know.
Closing as this is not a problem of the app itself
I am new to Workflow OCR and have been read thru the README as well as most of the issues but still cannot generate any PDF output even the OCR is done.
My current environment is as follows:
Here my workflow settings:
The result:
The OCRed file - https://demo.analytic360.biz/s/2crk6YjC3n3m6SW Even after OCR is successful, but still cannot search for any word inside the document.
By the way, how to check what content has successfully OCR? So far the Elasticsearch search nothing from the key word in this document.
Also discovered the table OC_JOBS in database cannot find any ProcessFileJob task but yet the OCR is done. Thanks in advance for the help.