dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.33k stars 295 forks source link

Tesseract not detected on RHEL Linux #1060

Closed tpanza closed 3 years ago

tpanza commented 3 years ago

Describe the bug

fscrawler claims tesseract is not found, despite setting the OCR path and data_path in my _settings.yaml file to correct locations.

Tesseract was installed on the system as the non-root user by building from source. The tesseract install seems fine. Quick check of running --version on it:

tesseract 4.1.1
 leptonica-1.74.3
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7

 Found AVX2
 Found AVX
 Found FMA
 Found SSE

Job Settings

---
name: "test1"
fs:
  url: "/mnt/my-fileshare1"
  update_rate: "10m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    path: "/home/azureuser/local/bin/tesseract"
    data_path: "/home/azureuser/local/share/tessdata"
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Logs

First, build leptonica and tesseract from source ...

sudo yum --noplugins --enablerepo=* install autoconf automake libtool autoconf-archive pkgconfig gcc gcc-c++ make libjpeg-devel libtiff-devel zlib-devel cairo-devel pango-devel icu-devel

# download tesseract-4.1.1.tar.gz
# download leptonica-1.74.3.tar.gz

tar -xzf leptonica-1.74.3.tar.gz
cd leptonica-1.74.3/
autoreconf -vif
./configure --prefix=$HOME/local/
make install
cd ..

tar -xzf tesseract-4.1.1.tar.gz
cd tesseract-4.1.1/
./autogen.sh
export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig
LIBLEPT_HEADERSDIR=$HOME/local/include ./configure --prefix=$HOME/local/ --with-extra-libraries=$HOME/local/lib
make
make install
cd ..
export TESSDATA_PREFIX=$HOME/local/share/tessdata

cd ~/elasticstack/fscrawler-es7-2.7-SNAPSHOT
export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig
export TESSDATA_PREFIX=$HOME/local/share/tessdata
FS_JAVA_OPTS="-DLOG_DIR=logs -DLOG_LEVEL=trace -DDOC_LEVEL=debug" bin/fscrawler test1 --restart --debug
01:05:10,426 [INFO ] [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [59.4mb/839.5mb=7.08%], RAM [556.1mb/3.6gb=14.75%], Swap [3.6gb/3.8gb=92.74%].
01:05:10,437 [DEBUG] [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
01:05:10,437 [DEBUG] [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
01:05:10,437 [DEBUG] [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
01:05:10,437 [DEBUG] [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
01:05:10,444 [DEBUG] [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [test1]...
01:05:10,446 [DEBUG] [f.p.e.c.f.c.FsCrawlerCli] Starting job [test1]...
01:05:11,577 [DEBUG] [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
01:05:13,623 [INFO ] [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.6.1
01:05:13,903 [INFO ] [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
01:05:13,903 [INFO ] [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
01:05:13,912 [DEBUG] [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.6.1] node.
01:05:13,912 [DEBUG] [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [test1]
01:05:14,513 [DEBUG] [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test1]
01:05:14,571 [DEBUG] [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [test1_folder]
01:05:14,592 [DEBUG] [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test1_folder]
01:05:14,618 [DEBUG] [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test1] for [/mnt/my-fileshare1] every [10m]
01:05:14,620 [INFO ] [f.p.e.c.f.FsParserAbstract] FS crawler started for [test1] for [/mnt/my-fileshare1] every [10m]
01:05:14,620 [DEBUG] [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test1] is now running. Run #1...
01:05:14,657 [DEBUG] [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1) = /
01:05:14,667 [DEBUG] [f.p.e.c.f.FsParserAbstract] Indexing test1_folder/312a2c39d9829b7137e7bada663882?pipeline=null
01:05:14,675 [DEBUG] [f.p.e.c.f.FsParserAbstract] indexing [/mnt/my-fileshare1] content
01:05:14,675 [DEBUG] [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from /mnt/my-fileshare1
01:05:14,792 [DEBUG] [f.p.e.c.f.c.f.FileAbstractorFile] 84 local files found
...
01:05:14,898 [DEBUG] [f.p.e.c.f.t.TikaInstance] OCR is activated.
01:05:14,916 [DEBUG] [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
01:05:15,779 [WARN ] [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

01:05:16,846 [DEBUG] [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
01:05:16,847 [DEBUG] [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/home/azureuser/local/bin/tesseract].
01:05:16,847 [DEBUG] [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/home/azureuser/local/share/tessdata].
01:05:16,847 [DEBUG] [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
01:05:18,322 [DEBUG] [f.p.e.c.f.FsParserAbstract] Indexing test1/55da60713adfb0b9a4e870f05168f36f?pipeline=null

Expected behavior

Expected tesseract to be found and used by fscrawler

Versions:

Attachment

If the bug is related to a given file, please share this file so we can reuse it in tests to reproduce the problem and may be use it in our integration tests.

tpanza commented 3 years ago

Found that if I add $HOME/local/bin (the dir to which tesseract was installed after building it from source) to my PATH, then I get a little further but different errors:

(TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly)

01:23:55,548 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
01:23:55,576 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
01:23:56,523 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

01:23:57,283 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
01:23:57,283 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/home/azureuser/local/bin/tesseract].
01:23:57,284 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/home/azureuser/local/share/tessdata].
01:23:57,284 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
...
01:24:04,448 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/mnt/my-fileshare1],[my-pdf-filename.pdf]
01:24:04,448 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/my-pdf-filename.pdf) = /my-pdf-filename.pdf
01:24:05,420 INFO  [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
01:24:05,448 INFO  [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
01:24:05,477 INFO  [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
01:24:05,704 INFO  [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
01:24:05,910 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/my-pdf-filename.pdf]  -> Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
01:24:05,911 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/my-pdf-filename.pdf]
org.apache.tika.exception.TikaException: Unable to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:146) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.22.jar:1.22]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.22.jar:1.22]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:474) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
        at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:446) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:169) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) ~[pdfbox-2.0.16.jar:2.0.16]
        at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) ~[pdfbox-2.0.16.jar:2.0.16]
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) ~[tika-parsers-1.22.jar:1.22]
        ... 9 more
Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
        at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:431) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:169) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) ~[pdfbox-2.0.16.jar:2.0.16]
        at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) ~[tika-parsers-1.22.jar:1.22]
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) ~[pdfbox-2.0.16.jar:2.0.16]
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) ~[tika-parsers-1.22.jar:1.22]
        ... 9 more
dadoonet commented 3 years ago

Is it possible to share your /mnt/my-fileshare1/my-pdf-filename.pdf file? I'd like to test it locally. Note that I probably won't be able to test it today.

tpanza commented 3 years ago

Reproduced it with this (newer, fscrawler-es7-2.7-20201209.172419-147) snapshot of fscrawler 2.7 and this PDF document.

From logs/documents.log:

2020-12-14 17:09:31,322 [ERROR] [826a3cf636e16c11fca3347cc51d1539][/p8-product-card-2020.pdf] Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly

Snippet from greping the logs/fscrawler.log file:

logs/fscrawler.log-807-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log-808-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/p8-product-card-2020.pdf], includes = [null], excludes = [[*/~*]]
logs/fscrawler.log-809-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/p8-product-card-2020.pdf], excludes = [[*/~*]]
logs/fscrawler.log-810-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/p8-product-card-2020.pdf], includes = [null]
logs/fscrawler.log-811-17:09:30,676 DEBUG [f.p.e.c.f.FsParserAbstract] [/p8-product-card-2020.pdf] can be indexed: [true]
logs/fscrawler.log-812-17:09:30,676 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /p8-product-card-2020.pdf
logs/fscrawler.log-813-17:09:30,749 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/mnt/my-fileshare1],[p8-product-card-2020.pdf]
logs/fscrawler.log-814-17:09:30,749 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log-815-17:09:31,321 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log:816:17:09:31,322 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/p8-product-card-2020.pdf]: Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
logs/fscrawler.log-817-17:09:31,322 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/p8-product-card-2020.pdf]
logs/fscrawler.log-818-org.apache.tika.exception.TikaException: Unable to extract PDF content
logs/fscrawler.log-819- at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:118) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-820- at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:173) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-821- at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.25.jar:1.25]
logs/fscrawler.log-822- at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.25.jar:1.25]
logs/fscrawler.log-823- at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-824- at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:96) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-825- at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:485) [fscrawler-core-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-826- at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:278) [fscrawler-core-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-827- at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-828- at java.lang.Thread.run(Thread.java:832) [?:?]
logs/fscrawler.log-829-Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
logs/fscrawler.log-830- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:554) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-831- at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-832- at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-833- at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-834- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-835- at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-836- at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-837- ... 9 more
logs/fscrawler.log:838:Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
logs/fscrawler.log-839- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:437) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-840- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:539) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-841- at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-842- at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-843- at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-844- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-845- at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-846- at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-847- ... 9 more
logs/fscrawler.log-848-17:09:31,328 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log-849-17:09:31,328 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test1/826a3cf636e16c11fca3347cc51d1539?pipeline=null

Also, upon the fscrawler's encountering of JPG image files, it no longer throws errors or warnings. However, there is no content field being stored for any of these jpeg images, so we are not able to search for any content within the image. So something appears to be failing silently (or simply not being done), with end result being image files are only searchable by metadata and not content.

dadoonet commented 3 years ago

I was able to extract locally the following content from your PDF document:

P-8
Proven Multimission, Multidomain Capability

322010_P8_Product Card_0620.indd   1 7/6/2020   8:05:48 AM

Pe ey GM aces ria Witte lute

Ps Bs
s r

322010_P8_Product Card_0620.indd 1

United States Contact: Perry Yaw
Telephone: 253-657-0842
Email: perry.d.yaw@boeing.com

International Contact: Tim Flood
Telephone: 301-247-8939
Email: timothy.k.flood@boeing.com

www.boeing.com/P-8/ 
Copyright © 2020 Boeing. All rights reserved.
322010 | 7/20

PROVEN     •     AFFORDABLE     •     RELIABLE     •     OVERSEA     •     OVERLAND     •     OVERALL

Technical Specifications
Wingspan 123.6 ft 37.64 m

Height 42.1 ft 12.83 m

Length 129.6 ft 39.50 m

Propulsion 
thrust, each

CFM-56-7BE (2) 
27,300 lb

121.44 kN

Max speed 490 kt  
564 mi/h

459+ ktas
907 km/h

Ceiling 41,000 ft 12,496 m

MGTW 189,200 lb 85,820 kg

Range 1,200+ nm with >4 hr time on 
station, 2,225+ km

Weapon stores 
compatibility

129 A-size sonobuoys
Harpoon
MK-54
Survival kit

Multimission Maritime 
Aircraft

With surveillance and reconnaissance, 
search and rescue, and long-range 
anti-submarine capabilities, the P-8 
is the most capable multimission 
aircraft deployed around the world, 
protecting seas and securing borders. 
A proven system with more than 100 
aircraft in service and over 300,000 
flight hours. The P-8’s performance 
and reliability delivers confidence in an 
uncertain world — in any condition, 
anywhere, anytime.

Innovative Solutions

The P-8 combines the most advanced 
weapon system in the world with 
the cost advantages of the most 
popular airliner on the planet. The 
P-8 shares 86% commonality with 
the commercial 737NG, providing 
enormous supply chain economies 
of scale in production and support. 
Boeing’s expertise in commercial fleet 
management and derivative aircraft 
sustainment provides customers 
with greater availability at a lower 
operational cost. 

Integrated Support

The P-8 provides high levels of fleet 
availability while reducing life-cycle 
cost. Our One Boeing solutions are 
leveraged for training, sustainment 
and service support. This included 
supply chain, global support, field 
service, data/tech publications, 
spares and repairs, modifications 
and retrofits.

Anti-Submarine Warfare

The P-8 executes anti-submarine 
warfare (ASW) through an integrated 
sensor suite to conduct search, 
detection, classification, localization, 
tracking and attack of submarines. 
The P-8 utilizes a state-of-the-art 
acoustics sensor suite, sonobuoys, 
electronic support measures (ESM), 
inverse synthetic aperture radar (ISAR) 
and the delivery of torpedoes for 
sub-hunting.

Anti-Surface Warfare

The P-8 executes antisurface warfare 
(ASuW) through elegant communications 
and data link systems. This integrated 
sensor suite conducts search, detection, 
classification, localization, tracking and 
attack of naval surface targets, utilizing 
ESM and intelligence, surveillance and 
reconnaissance (ISR) and delivering of 
Harpoon missiles.

Maritime ISR

The P-8 accomplishes maritime ISR 
through a proven sensor suite of radar 
with ISAR, synthetic aperture radar 
(SAR), periscope, search and navigation. 

These systems are optimized for combat 
ready maritime patrol in detecting, locating 
and tracking surface and undersea targets. 

Overland ISR

The P-8 has the proven capability to 
effectively conduct overland ISR and 
battle space control (C2) of land forces 
using its advanced mission system, 
data link and electro-optical/infrared 
sensor suite. 

Communications 

The P-8 has a full complement of radio 
frequency communications via Link 11 
and 16 to support coordination of 
operations. This includes wideband 
satellite communications with ground 
stations that enable interoperability 
with allies and partner nations. 

Search and Rescue

With its advanced sensors and long 
endurance, further enhanced by in-flight 
refueling, the P-8 can search and deliver 
rescue stores in large ocean and overland 
areas quickly at high and low altitudes. 
This includes the carriage and release 
of the UNI-PAC search and rescue 
survival kit. 

322010_P8_Product Card_0620.indd   2 7/6/2020   8:05:49 AM

PROVEN

Multimission Maritime
Aircraft

With surveillance and reconnaissance,
search and rescue, and long-range
anti-submarine capabilities, the P-8
is the most capable multimission
aircraft deployed around the world,
protecting seas and securing borders.
A proven system with more than 100
© aircraft in service and over 300,000
flight hours. The P-8’s performance
and reliability delivers confidence in an
uncertain world — in any condition,
anywhere, anytime.

Technical Specifications

Wingspan aPxCHOnit 37.64 m
Height Ci 12.83 m
Length 129.6 ft 39.50 m

CFM-56-7BE (2) 121.44 kN
27,300 lb

Cre at
564 mi/h

Propulsion
thrust, each

459+ ktas
iC] OYA nny)
41,000 ft 12,496 m
189,200 lb 85,820 kg
1,200+ nm with >4 hr time on
station, 2,225+ km

129 A-size sonobuoys
Imre elele)a

Mk-54

Survival kit

Max speed

Ceiling
MGTW
Range

Weapon stores
compatibility

www.boeing.com/P-8/

Copyright © 2020 Boeing. All rights reserved.
322010 | 7/20

| | 322010_P8_Product Card_0620.indd 2

AFFORDABLE

RELIABLE

Anti-Submarine Warfare

The P-8 executes anti-submarine
warfare (ASW) through an integrated
sensor suite to conduct search,
detection, classification, localization,
tracking and attack of submarines.
The P-8 utilizes a state-of-the-art
acoustics sensor suite, Sonobuoys,
electronic support measures (ESM),
inverse synthetic aperture radar (ISAR)
and the delivery of torpedoes for
sub-hunting.

Anti-Surface Warfare

The P-8 executes antisurface warfare
(ASuW) through elegant communications
and data link systems. This integrated
sensor suite conducts search, detection,
Classification, localization, tracking and
attack of naval surface targets, utilizing
ESM and intelligence, surveillance and
reconnaissance (ISR) and delivering of
Harpoon missiles.

Maritime ISR

The P-8 accomplishes maritime ISR
through a proven sensor suite of radar
with ISAR, synthetic aperture radar
(SAR), periscope, search and navigation.

OVERSEA

These systems are optimized for combat
ready maritime patrol in detecting, locating

and tracking surface and undersea targets.

Overland ISR

The P-8 has the proven capability to
effectively conduct overland ISR and
battle space control (C2) of land forces
using its advanced mission system,
data link and electro-optical/infrared
sensor suite.

Communications

The P-8 has a full complement of radio
frequency communications via Link 11
and 16 to support coordination of
operations. This includes wideband
satellite communications with ground
stations that enable interoperability
with allies and partner nations.

Search and Rescue

With its advanced sensors and long
endurance, further enhanced by in-flight
refueling, the P-8 can search and deliver
rescue stores in large ocean and overland
areas quickly at high and low altitudes.
This includes the carriage and release

of the UNI-PAC search and rescue
survival kit.

e OVERLAND e OVERALL

—

Innovative Solutions

The P-8 combines the most advanced
weapon system in the world with

the cost advantages of the most
popular airliner on the planet. The

P-8 shares 86% commonality with

the commercial 737NG, providing
enormous supply chain economies

of scale in production and support.
Boeing’s expertise in commercial fleet
management and derivative aircraft J
sustainment provides customers

with greater availability at a lower
operational cost.

Integrated Support

The P-8 provides high levels of fleet
availability while reducing life-cycle
cost. Our One Boeing solutions are
leveraged for training, sustainment
and service support. This included
supply chain, global support, field
service, data/tech publications,
spares and repairs, modifications
and retrofits.

United States Contact: Perry Yaw
Telephone: 253-657-0842
Email: perry.d.yaw@boeing.com

International Contact: Tim Flood
Telephone: 301-247-8939
Email: timothy.k.flood@boeing.com

7/6/2020 8:05:49 AM | |

I need to check a bit more why it does not work on your side. Stay tuned.

EDIT: this is the version I'm using locally:

> ~ $ tesseract --version
tesseract 4.1.1
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
dadoonet commented 3 years ago

I can reproduce the problem:

11:39:35,903 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/usr/local/Cellar/tesseract/4.1.1/bin/tesseract].
11:39:35,903 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/usr/local/Cellar/tesseract/4.1.1/share/tessdata].
11:39:35,903 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
11:39:36,157 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(../docs/1060, ../docs/1060/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
11:39:36,157 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [../docs/1060/p8-product-card-2020.pdf]: Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
11:39:36,159 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [../docs/1060/p8-product-card-2020.pdf]
org.apache.tika.exception.TikaException: Unable to extract PDF content
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:118) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:173) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.25.jar:1.25]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.25.jar:1.25]
    at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:96) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) [fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:281) [fscrawler-core-2.7-SNAPSHOT.jar:?]
    at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.7-SNAPSHOT.jar:?]
    at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:554) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
    ... 9 more
Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:437) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:539) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
    ... 9 more

Let me see now what is happening 😬

dadoonet commented 3 years ago

I found the bug. It's actually a documentation issue.

If you define the tesseract path to /home/azureuser/local/bin/ instead of /home/azureuser/local/bin/tesseract, it should work.

tpanza commented 3 years ago

@dadoonet thank you! Confirmed that this now works