Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Tesseract get started with "-l eng" always #99

Closed jetnet closed 5 years ago

jetnet commented 5 years ago

Hello Pascal,

I've got the following crawler config:

        <importer>
                <preParseHandlers>
                        #parse("../shared/preParseHandlers.xml")
                </preParseHandlers>
                <documentParserFactory>
                        #parse("../shared/documentParserFactory.xml")
                </documentParserFactory>
                <postParseHandlers>

                        #parse("../shared/postParseHandlers.xml")
                </postParseHandlers>
        </importer>

and its ../shared/documentParserFactory.xml:

<ocr path="/usr/bin">
        <languages>eng,deu</languages>
        <contentTypes>image/jpeg,image/png,image/gif</contentTypes>
</ocr>

There are two langs specified, but for some reasons it always gets started like this:

tesseract /tmp/apache-tika-5700897182878023572.tmp /tmp/apache-tika-3728352888399241841.tmp -l eng -psm 1 txt -c preserve_interword_spaces=0

Tesseract has been installed as via apt-get on Ubuntu:

tesseract --list-langs
List of available languages (4):
deu
osd
eng
equ

I double-double checked everything, but could not find what could be wrong. I even tried to set the langs like this: <languages>deu+eng</languages>, but nothing changed. Any ideas?.. Thank you!

jetnet commented 5 years ago

I apologize, sometime a "double-double" check is not enough. <contentTypes> must be a regex.