Closed adam-stanek closed 3 years ago
Hi, thanks for the note! It's something I would be interested in looking into for sure. I just took over the project a few months ago and promptly got busy with school. I'm about 6 weeks from finishing this degree and then I can dedicate quite a bit more time to it. I'm a front-end developer historically so I'm not as familiar with the backend parts yet.
I can give it a shot if you want. I specialise in FE as well, but it shouldn't be a hard thing to implement. I started learning Go for fun projects over holidays so it might be a good practice ;)
Good luck with your degree! :)
Implemented in the MR.
Hello,
I am testing out lodestone. I like how it works with some test data, but sadly Tika does not seem to produce reasonable results for documents in my native language (Czech). I have built myself a custom docker image with additional tesseract language pack (
tesseract-ocr-ces
) but it didn't help on its own. It seems that Tika needs a little hint before it starts OCR processing so that it understands contained characters better. I have found out that it is possible to give it a hint by passing additional HTTP header to the request (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR). Here is an example:This header would have to be passed by the lodestone processor when submitting the job (https://github.com/LodestoneHQ/lodestone-processor/blob/master/pkg/processor/document/document.go#L187).
Would you be open to add OCR language as a CLI parameter for the processor and propagating it to the Tika or do you have some more complex strategy in mind how to handle the languages?