deranjer / goEDMS

golang/react EDMS for home users
MIT License
7 stars 2 forks source link

ImageMagick and Tesseract failures #4

Open Penquincoder opened 4 years ago

Penquincoder commented 4 years ago

Just set this up and ran into some bugs while trying to import PDF files:

System: CentOS 7 3.10.0-1062.9.1.el7.x86_64 Method: Download release goEDMS_0.1.8_Linux_x86_64.tar.gz SESTATUS: permissive

Modified serverConfig.toml for correct paths to convert and tesseract

$ which tesseract
/bin/tesseract

$which convert
/bin/convert
[ingress]
    IngressPath = 'staging'

[ocr]
    TesseractBin = "/bin/tesseract"
    MagickBin = "/bin/convert"   

Copied existing PDFs to /opt/goEDMS/staging, and receive the following errors in goedms.log for ALL pdfs to ingest:

{"level":"info","time":"2020-02-02T23:02:09-06:00","message":"Converting PDF To image for OCR/opt/goEDMS/staging/bill.pdf"}
{"level":"info","time":"2020-02-02T23:02:09-06:00","message":"Creating temp image for OCR at: /opt/goEDMS/temp/bill.png"}
{"level":"error","time":"2020-02-02T23:02:09-06:00","message":"Unable to convert PDF Using Magick: /opt/goEDMS/staging/bill.pdfexit status 1"}
{"level":"error","time":"2020-02-02T23:02:09-06:00","message":"OCR Processing failed on file: /opt/goEDMS/staging/bill.pdf: exit status 1"}  

No documents appear in the web-gui.

deranjer commented 4 years ago

Can you install the latest version and try again? I've updated the error reporting to get a much more detailed error response.

Penquincoder commented 4 years ago

Thanks for adding additional debugging. I'm going to say there's a definite bug here, with potential for destruction of data:

Log

{"level":"debug","time":"2020-02-03T19:01:38-06:00","message":"Starting processing for file: /opt/goEdms/staging/20280201.pdf"}
{"level":"debug","time":"2020-02-03T19:01:38-06:00","message":"Working on current file: 20280201.pdf"}        
{"level":"info","time":"2020-02-03T19:01:38-06:00","message":"Text processed from PDF without OCR: 20280201.pdf"}           
{"level":"info","time":"2020-02-03T19:01:38-06:00","message":"No record found, assume no duplicate hash: not found"}
{"level":"debug","time":"2020-02-03T19:01:38-06:00","message":"Adding full text for search to bleve: Creation date: 2020-01-28<OCR'd text>"}
{"level":"-","time":"2020-02-03T19:02:38-06:00","message":"wake, now=2020-02-03T19:02:38-06:00"}
{"level":"-","time":"2020-02-03T19:02:38-06:00","message":"run, now=2020-02-03T19:02:38-06:00, entry=1, next=2020-02-03T19:03:38-06:00"}
{"level":"info","time":"2020-02-03T19:02:38-06:00","message":"Starting Ingress Job on folder:/opt/goEdms/staging"}
{"level":"debug","time":"2020-02-03T19:02:38-06:00","message":"Starting processing for file: /opt/goEdms/staging"}
{"level":"warn","time":"2020-02-03T19:02:38-06:00","message":"Unable to get information for file, won't process: /opt/goEdms/staging: stat /opt/goEdms/staging: no such file or directory"}

Based on the error, I checked the /opt/goEdms directory. Sure enough /staging/ doesn't exist! It appears that the goEDMS processing is deleting the staging directory instead of just the processed files.

Changing the serverConfig.toml option for IngressDeleteOnProcess doesn't affect the outcome. /staging/ directory is still deleted entirely for either true/false.

No documents appear in the web gui.

deranjer commented 4 years ago

Okay, let me do some testing and get back to you.

deranjer commented 4 years ago

So for some reason goEDMS is saying that 'staging' is a file, not a folder. I'm not sure why that is. I'm adding a few more logging statements and checks to ensure that the root ingress folder is not deleted. I'll hopefully push a new build today for you to try out.

deranjer commented 4 years ago

Okay version 0.2.0 is out, please try that and let me know.