Configurable OCR language

adam-stanek commented 3 years ago

Hello,

I am testing out lodestone. I like how it works with some test data, but sadly Tika does not seem to produce reasonable results for documents in my native language (Czech). I have built myself a custom docker image with additional tesseract language pack (tesseract-ocr-ces) but it didn't help on its own. It seems that Tika needs a little hint before it starts OCR processing so that it understands contained characters better. I have found out that it is possible to give it a hint by passing additional HTTP header to the request (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR). Here is an example:

curl -T somedoc.pdf -H "X-Tika-OCRLanguage: ces" http://127.0.0.1:9998/tika

This header would have to be passed by the lodestone processor when submitting the job (https://github.com/LodestoneHQ/lodestone-processor/blob/master/pkg/processor/document/document.go#L187).

Would you be open to add OCR language as a CLI parameter for the processor and propagating it to the Tika or do you have some more complex strategy in mind how to handle the languages?

dskaggs commented 3 years ago

Hi, thanks for the note! It's something I would be interested in looking into for sure. I just took over the project a few months ago and promptly got busy with school. I'm about 6 weeks from finishing this degree and then I can dedicate quite a bit more time to it. I'm a front-end developer historically so I'm not as familiar with the backend parts yet.

adam-stanek commented 3 years ago

I can give it a shot if you want. I specialise in FE as well, but it shouldn't be a hard thing to implement. I started learning Go for fun projects over holidays so it might be a good practice ;)

Good luck with your degree! :)

adam-stanek commented 3 years ago

Implemented in the MR.

LodestoneHQ / lodestone

Configurable OCR language #106