dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Use External API for OCR ie amazon textract or google vision #794

Open Bowriverstudio opened 5 years ago

Bowriverstudio commented 5 years ago

Is your feature request related to a problem? Please describe.

Tesseract does not handle the PDF's I'd like to OCR strong enough.

Describe the solution you'd like

I want to be able to use an external API such as:

https://aws.amazon.com/textract https://aws.amazon.com/rekognition/ https://cloud.google.com/vision/docs/ocr https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text

Describe alternatives you've considered

I am willing to hire a developer to build this feature if it is not included already. If I do hire someone, I'd like to give it to this community.

dadoonet commented 5 years ago

Thanks for proposing this change. That'd require a lot of changes though and because we are using Tika to do the extraction, I think that this change would have more sense to be added there. @tballison might tell.

Bowriverstudio commented 5 years ago

Hello and thanks for the suggestion.

I think you are correct and it seems like I'm not the first one to request this information.

https://stackoverflow.com/questions/51767916/how-to-configure-google-vision-api-with-tika-parser

Looks like I just need to do this: https://cwiki.apache.org/confluence/display/tika/TikaOCR

If you have any additional suggestions please let me know.

Thanks again.