Error: Permission denied: 'tesseract'

leandrodamascena commented 5 years ago

Hi man .. First of all, thank you for your doing this project ... it's very intersting for me .. Now I need to OCR more than 6TB of PDF files and using an EC2 instance with only ocrmypdf (python project) is too slow and not performatic ..

I made all the configurations that you explain in README.md, but I'm facing a tesseract error.. I enabled X-Ray debug and extract exact error. below I'm send some prints of my configuration.. Thank you in advanced and let me know if you need more informations.

Error in lambda console

Configuration of enviroment variables

My ROLE

X-RAY debug

hdwatts commented 5 years ago

Hi, thanks for submitting the issue.

Can you share an example of your event config? Also did you download the zip from the releases page or clone the source and compress it yourself?

leandrodamascena commented 5 years ago

Hi.. now working fine.. the problem was files in bin paste are not executable by default... I'm downloaded this on linux ami instance and change permission to be executable.. thank you so much..

The problem is cause my PDF file is in Portuguese (Brazil) language and the OCR was not so good.. I'll read the code and try to implement multilanguage support.

hdwatts commented 5 years ago

Thanks, I'll look into that! I want to note that I just added a release that includes multi-threading. This leads to about a 4x increase in speed (since it runs OCR on 4 pages at a time).

Let me know if downloading the Portuguese .traineddata file from here, placing it into the tessdata folder, and recompressing works.

You will probably have to delete the eng.traineddata file to remain within AWS function size limits.

leandrodamascena commented 5 years ago

I downloaded a new version and some new problems are ocurring...

1 - Limit size excedded - The limit size of unzipped file exceds lambda limit (262144000 bytes).. I removed file "tessdata/osd.traineddata" and worked fine.. The time of execution is better than previous version.. I'm cocerned about this removed file, I really don't know about consequences using daily..

2 - When I add por.traineddata and remove english file doesn't work.. the system always expect english language to use by default.

hdwatts commented 5 years ago

Apologies, I included some unnecessary files in the .zip, leading to it being too big. This is fixed if you re-download the release.

In _validations.py on line 61 OCRmyPDF appears to default to eng. I believe if you modify the ocrmypdf call on line 30 in apply-ocr-to-s3-object.py to be something like: ocrmypdf.ocr(inputname, outputname, pages=pages, force_ocr=True, lambda_safe=True, language=['por']) then it should skip the validation for eng and look for por.

As for osd.traineddata, that is Orientation and script detection data, so definitely something that is useful depending on your input. I would keep it in. More information here: https://ai.google/research/pubs/pub35506

leandrodamascena commented 5 years ago

Still not working... Things that I tried..

1 - Opened _validation.py and put directly "por" language in default language and deleted "eng.traineedata".. I had the same error about language.

2 - Tried to keep "eng.trainedata" inside the folder and removed "dist-info" directories inside python directory and lambda size was exceeded..

3 - Are you sure that you deleted some files from repo and commited? I didn't see this commit.

Thank you man.

hdwatts commented 5 years ago

So I didn't make a commit to remove anything from the repo. I only updated the lambda-OCRmyPDF.zip file in the releases section.

I have done a bunch of tinkering and found that the issues stem from the way tesseract is being called for some utility functions. Even just to print parameters it requires eng.traineddata by default!!!

I have created a hardcoded por language zip file for you. It can be found here: https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip

Note that the event must have a language='por' param. As shown below:

{
  "pages": "1",
  "awsRegion": "us-east-1",
  "language": "por",
  "s3": {
    "bucket": {
      "name": [BUCKETNAME]
    },
    "object": {
      "key": "input.pdf"
    }
  }
}

Note: I have no idea if this works on Portuguese files, please let me know. I have tested on a basic input.pdf file and the lambda function completes without issue, but do not know if the OCR actually works.

If it does work please let me know and I will work on an official multilanguage support release.

leandrodamascena commented 5 years ago

Now is working nice using language as a configuration... But I'm still facing a problem with portuguese.. PDF OCR is not recognizing words in portuguese..

do you have an ephemeral container or another envinroment to test stand alone this code? I could configure here and test with diferents scenarios...

thank you.. its a really nice project!

hdwatts commented 5 years ago

That is probably due to the por.traineddata coming from the TessData Fast repository instead of the normal TessData to save on space so it fits in Lambda.

Is it recognizing any Portuguese words or none at all?

And while I do not have an ephemeral container, the library itself has a docker container here: https://ocrmypdf.readthedocs.io/en/latest/docker.html

leandrodamascena commented 5 years ago

I'm already downloaded TessData not "compressed", the full size...

No no isn't recognizing any words..

Do you want the original PDF that I'm trying? I can send to your email...

hdwatts commented 5 years ago

Sure - send it to the email linked in my github profile.

harnit-bakshi commented 4 years ago

Any update on this did it work for POR?

krzischp commented 2 years ago

Hi, do you have any update, please? Your link is broken: https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip

chronograph-pe / lambda-OCRmyPDF

Error: Permission denied: 'tesseract' #2