Open leandrodamascena opened 5 years ago
Hi, thanks for submitting the issue.
Can you share an example of your event config? Also did you download the zip from the releases page or clone the source and compress it yourself?
Hi.. now working fine.. the problem was files in bin paste are not executable by default... I'm downloaded this on linux ami instance and change permission to be executable.. thank you so much..
The problem is cause my PDF file is in Portuguese (Brazil) language and the OCR was not so good.. I'll read the code and try to implement multilanguage support.
Thanks, I'll look into that! I want to note that I just added a release that includes multi-threading. This leads to about a 4x increase in speed (since it runs OCR on 4 pages at a time).
Let me know if downloading the Portuguese .traineddata
file from here, placing it into the tessdata
folder, and recompressing works.
You will probably have to delete the eng.traineddata
file to remain within AWS function size limits.
I downloaded a new version and some new problems are ocurring...
1 - Limit size excedded - The limit size of unzipped file exceds lambda limit (262144000 bytes).. I removed file "tessdata/osd.traineddata" and worked fine.. The time of execution is better than previous version.. I'm cocerned about this removed file, I really don't know about consequences using daily..
2 - When I add por.traineddata and remove english file doesn't work.. the system always expect english language to use by default.
Apologies, I included some unnecessary files in the .zip, leading to it being too big. This is fixed if you re-download the release.
In _validations.py
on line 61 OCRmyPDF appears to default to eng
. I believe if you modify the ocrmypdf
call on line 30 in apply-ocr-to-s3-object.py
to be something like:
ocrmypdf.ocr(inputname, outputname, pages=pages, force_ocr=True, lambda_safe=True, language=['por'])
then it should skip the validation for eng
and look for por
.
As for osd.traineddata
, that is Orientation and script detection
data, so definitely something that is useful depending on your input. I would keep it in. More information here: https://ai.google/research/pubs/pub35506
Still not working... Things that I tried..
1 - Opened _validation.py and put directly "por" language in default language and deleted "eng.traineedata".. I had the same error about language.
2 - Tried to keep "eng.trainedata" inside the folder and removed "dist-info" directories inside python directory and lambda size was exceeded..
3 - Are you sure that you deleted some files from repo and commited? I didn't see this commit.
Thank you man.
So I didn't make a commit to remove anything from the repo. I only updated the lambda-OCRmyPDF.zip
file in the releases section.
I have done a bunch of tinkering and found that the issues stem from the way tesseract is being called for some utility functions. Even just to print parameters it requires eng.traineddata
by default!!!
I have created a hardcoded por
language zip file for you. It can be found here: https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip
Note that the event must have a language='por'
param. As shown below:
{
"pages": "1",
"awsRegion": "us-east-1",
"language": "por",
"s3": {
"bucket": {
"name": [BUCKETNAME]
},
"object": {
"key": "input.pdf"
}
}
}
Note: I have no idea if this works on Portuguese files, please let me know. I have tested on a basic input.pdf
file and the lambda function completes without issue, but do not know if the OCR actually works.
If it does work please let me know and I will work on an official multilanguage support release.
Now is working nice using language as a configuration... But I'm still facing a problem with portuguese.. PDF OCR is not recognizing words in portuguese..
do you have an ephemeral container or another envinroment to test stand alone this code? I could configure here and test with diferents scenarios...
thank you.. its a really nice project!
That is probably due to the por.traineddata
coming from the TessData Fast repository instead of the normal TessData to save on space so it fits in Lambda.
Is it recognizing any Portuguese words or none at all?
And while I do not have an ephemeral container, the library itself has a docker container here: https://ocrmypdf.readthedocs.io/en/latest/docker.html
I'm already downloaded TessData not "compressed", the full size...
No no isn't recognizing any words..
Do you want the original PDF that I'm trying? I can send to your email...
Sure - send it to the email linked in my github profile.
Any update on this did it work for POR?
Hi, do you have any update, please? Your link is broken: https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.1-alpha-por/lambda-por-OCRmyPDF.zip
Hi man .. First of all, thank you for your doing this project ... it's very intersting for me .. Now I need to OCR more than 6TB of PDF files and using an EC2 instance with only ocrmypdf (python project) is too slow and not performatic ..
I made all the configurations that you explain in README.md, but I'm facing a tesseract error.. I enabled X-Ray debug and extract exact error. below I'm send some prints of my configuration.. Thank you in advanced and let me know if you need more informations.
Error in lambda console
Configuration of enviroment variables
My ROLE
X-RAY debug