OCR Successfully Done but no output PDF generated and ProcessFileJob cannot be found in OC_JOBS

kevinkuan1969 commented 1 year ago

I am new to Workflow OCR and have been read thru the README as well as most of the issues but still cannot generate any PDF output even the OCR is done.

My current environment is as follows:

Nextcloud 26.0.2
Workflow OCR 1.26.1

Here my workflow settings:

The result:

The OCRed file - https://demo.analytic360.biz/s/2crk6YjC3n3m6SW Even after OCR is successful, but still cannot search for any word inside the document.

By the way, how to check what content has successfully OCR? So far the Elasticsearch search nothing from the key word in this document.

Also discovered the table OC_JOBS in database cannot find any ProcessFileJob task but yet the OCR is done. Thanks in advance for the help.

R0Wi commented 1 year ago

Hi, please follow the guide from https://github.com/R0Wi-DEV/workflow_ocr#troubleshooting. After you decreased the loglevel, try to trigger the OCR workflow by uploading a new file. Right after that, check the oc_jobs table, it should contain a job for the asynchonous OCR process. If it contains the job, please execute your cron.php. This should trigger the actual ocrmypdf process.

You can check the PDF itself by having a look at the NC versions, you should see that the OCR process has created another file version (see https://github.com/R0Wi-DEV/workflow_ocr#troubleshooting). You can also check the PDF metadata or properties. You will see that the "creator" is set so ocrmypdf if the file has been processed with the OCR tool:

If you can, please share your server logs here after you completed the whole process.

Hope this helps!

kevinkuan1969 commented 1 year ago

Thanks @R0Wi for your prompt reply but the below problems are still the same.

OCR is done but no OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob can be found in the log and OC_JOBS
The OCR is done partially and text in the picture is not extracted.

Here the log file with filter "WorkflowOcr"

nextcloud.log

Here the sample pdf my country ID and trying to OCR all the text inside the card.

myAI-ID.pdf

R0Wi commented 1 year ago

Ok I think now I understood your problem. Seems that the NC OCR process is just working fine but you're PDF result is not like expected, right?

If you have a look at your server logs, you see one line where you can extract the actual ocrmypdf command:

Running command: ocrmypdf -q --force-ocr -l eng+chi_sim+chi_tra -j 2 --sidecar /tmp/oc_tmp_FG9iP0-.sidecar - - | cat

So I tried to do this directly inside of my commandline without the -q-flag to get some warning messages and stuff. Got the following:

docker@9b4c2579cc58:/var/www/html/nextcloud$ ocrmypdf --force-ocr --sidecar sidecar.txt  myAI-ID.pdf out.pdf
Scan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 146.52page/s]
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: page already has text! - rasterizing text and running OCR anyway                                                                                    
OCR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  1.31page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)

docker@9b4c2579cc58:/var/www/html/nextcloud$ cat sidecar.txt 
Identification
Document — myAl
Cloud

So as you can see, the text inside of the identity card is indeed not processed by ocrmypdf. Unfortunately I don't know why this doesn't work but it's definitely rather a problem with ocrmypdf than with this NC app here. So I'd suggest you reach out to the guys from ocrmypdf directly and ask for help. Feel free to share the info here.

Also I'd maybe try some other PDF documents, maybe there's just something wrong with this one, I don't know.

R0Wi commented 1 year ago

Closing as this is not a problem of the app itself

R0Wi-DEV / workflow_ocr

OCR Successfully Done but no output PDF generated and ProcessFileJob cannot be found in OC_JOBS #204