Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
There are two langs specified, but for some reasons it always gets started like this:
tesseract /tmp/apache-tika-5700897182878023572.tmp /tmp/apache-tika-3728352888399241841.tmp -l eng -psm 1 txt -c preserve_interword_spaces=0
Tesseract has been installed as via apt-get on Ubuntu:
tesseract --list-langs
List of available languages (4):
deu
osd
eng
equ
I double-double checked everything, but could not find what could be wrong.
I even tried to set the langs like this: <languages>deu+eng</languages>, but nothing changed.
Any ideas?..
Thank you!
Hello Pascal,
I've got the following crawler config:
and its
../shared/documentParserFactory.xml
:There are two langs specified, but for some reasons it always gets started like this:
Tesseract has been installed as via apt-get on Ubuntu:
I double-double checked everything, but could not find what could be wrong. I even tried to set the langs like this:
<languages>deu+eng</languages>
, but nothing changed. Any ideas?.. Thank you!