Closed tpanza closed 3 years ago
Found that if I add $HOME/local/bin
(the dir to which tesseract was installed after building it from source) to my PATH
, then I get a little further but different errors:
(TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
)
01:23:55,548 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
01:23:55,576 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
01:23:56,523 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
01:23:57,283 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
01:23:57,283 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/home/azureuser/local/bin/tesseract].
01:23:57,284 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/home/azureuser/local/share/tessdata].
01:23:57,284 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
...
01:24:04,448 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/mnt/my-fileshare1],[my-pdf-filename.pdf]
01:24:04,448 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/my-pdf-filename.pdf) = /my-pdf-filename.pdf
01:24:05,420 INFO [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F1 are not implemented in PDFBox and will be ignored
01:24:05,448 INFO [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F2 are not implemented in PDFBox and will be ignored
01:24:05,477 INFO [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F3 are not implemented in PDFBox and will be ignored
01:24:05,704 INFO [o.a.p.p.f.PDCIDFontType2] OpenType Layout tables used in font CIDFont+F4 are not implemented in PDFBox and will be ignored
01:24:05,910 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/my-pdf-filename.pdf] -> Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
01:24:05,911 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/my-pdf-filename.pdf]
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:146) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.22.jar:1.22]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.22.jar:1.22]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:474) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:446) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:169) ~[tika-parsers-1.22.jar:1.22]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) ~[pdfbox-2.0.16.jar:2.0.16]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) ~[tika-parsers-1.22.jar:1.22]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) ~[pdfbox-2.0.16.jar:2.0.16]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) ~[tika-parsers-1.22.jar:1.22]
... 9 more
Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:431) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:169) ~[tika-parsers-1.22.jar:1.22]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393) ~[pdfbox-2.0.16.jar:2.0.16]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) ~[tika-parsers-1.22.jar:1.22]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:835) ~[tika-parsers-1.22.jar:1.22]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) ~[pdfbox-2.0.16.jar:2.0.16]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) ~[tika-parsers-1.22.jar:1.22]
... 9 more
Is it possible to share your /mnt/my-fileshare1/my-pdf-filename.pdf
file? I'd like to test it locally. Note that I probably won't be able to test it today.
Reproduced it with this (newer, fscrawler-es7-2.7-20201209.172419-147) snapshot of fscrawler 2.7 and this PDF document.
From logs/documents.log:
2020-12-14 17:09:31,322 [ERROR] [826a3cf636e16c11fca3347cc51d1539][/p8-product-card-2020.pdf] Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
Snippet from grep
ing the logs/fscrawler.log
file:
logs/fscrawler.log-807-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log-808-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/p8-product-card-2020.pdf], includes = [null], excludes = [[*/~*]]
logs/fscrawler.log-809-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/p8-product-card-2020.pdf], excludes = [[*/~*]]
logs/fscrawler.log-810-17:09:30,676 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/p8-product-card-2020.pdf], includes = [null]
logs/fscrawler.log-811-17:09:30,676 DEBUG [f.p.e.c.f.FsParserAbstract] [/p8-product-card-2020.pdf] can be indexed: [true]
logs/fscrawler.log-812-17:09:30,676 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /p8-product-card-2020.pdf
logs/fscrawler.log-813-17:09:30,749 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/mnt/my-fileshare1],[p8-product-card-2020.pdf]
logs/fscrawler.log-814-17:09:30,749 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log-815-17:09:31,321 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log:816:17:09:31,322 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/p8-product-card-2020.pdf]: Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
logs/fscrawler.log-817-17:09:31,322 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/mnt/my-fileshare1/p8-product-card-2020.pdf]
logs/fscrawler.log-818-org.apache.tika.exception.TikaException: Unable to extract PDF content
logs/fscrawler.log-819- at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:118) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-820- at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:173) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-821- at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.25.jar:1.25]
logs/fscrawler.log-822- at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.25.jar:1.25]
logs/fscrawler.log-823- at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-824- at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:96) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-825- at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:485) [fscrawler-core-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-826- at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:278) [fscrawler-core-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-827- at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.7-SNAPSHOT.jar:?]
logs/fscrawler.log-828- at java.lang.Thread.run(Thread.java:832) [?:?]
logs/fscrawler.log-829-Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
logs/fscrawler.log-830- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:554) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-831- at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-832- at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-833- at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-834- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-835- at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-836- at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-837- ... 9 more
logs/fscrawler.log:838:Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
logs/fscrawler.log-839- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:437) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-840- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:539) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-841- at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-842- at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-843- at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-844- at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-845- at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
logs/fscrawler.log-846- at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
logs/fscrawler.log-847- ... 9 more
logs/fscrawler.log-848-17:09:31,328 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/mnt/my-fileshare1, /mnt/my-fileshare1/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
logs/fscrawler.log-849-17:09:31,328 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test1/826a3cf636e16c11fca3347cc51d1539?pipeline=null
Also, upon the fscrawler's encountering of JPG image files, it no longer throws errors or warnings. However, there is no content
field being stored for any of these jpeg images, so we are not able to search for any content within the image. So something appears to be failing silently (or simply not being done), with end result being image files are only searchable by metadata and not content.
I was able to extract locally the following content from your PDF document:
P-8
Proven Multimission, Multidomain Capability
322010_P8_Product Card_0620.indd 1 7/6/2020 8:05:48 AM
Pe ey GM aces ria Witte lute
Ps Bs
s r
322010_P8_Product Card_0620.indd 1
United States Contact: Perry Yaw
Telephone: 253-657-0842
Email: perry.d.yaw@boeing.com
International Contact: Tim Flood
Telephone: 301-247-8939
Email: timothy.k.flood@boeing.com
www.boeing.com/P-8/
Copyright © 2020 Boeing. All rights reserved.
322010 | 7/20
PROVEN • AFFORDABLE • RELIABLE • OVERSEA • OVERLAND • OVERALL
Technical Specifications
Wingspan 123.6 ft 37.64 m
Height 42.1 ft 12.83 m
Length 129.6 ft 39.50 m
Propulsion
thrust, each
CFM-56-7BE (2)
27,300 lb
121.44 kN
Max speed 490 kt
564 mi/h
459+ ktas
907 km/h
Ceiling 41,000 ft 12,496 m
MGTW 189,200 lb 85,820 kg
Range 1,200+ nm with >4 hr time on
station, 2,225+ km
Weapon stores
compatibility
129 A-size sonobuoys
Harpoon
MK-54
Survival kit
Multimission Maritime
Aircraft
With surveillance and reconnaissance,
search and rescue, and long-range
anti-submarine capabilities, the P-8
is the most capable multimission
aircraft deployed around the world,
protecting seas and securing borders.
A proven system with more than 100
aircraft in service and over 300,000
flight hours. The P-8’s performance
and reliability delivers confidence in an
uncertain world — in any condition,
anywhere, anytime.
Innovative Solutions
The P-8 combines the most advanced
weapon system in the world with
the cost advantages of the most
popular airliner on the planet. The
P-8 shares 86% commonality with
the commercial 737NG, providing
enormous supply chain economies
of scale in production and support.
Boeing’s expertise in commercial fleet
management and derivative aircraft
sustainment provides customers
with greater availability at a lower
operational cost.
Integrated Support
The P-8 provides high levels of fleet
availability while reducing life-cycle
cost. Our One Boeing solutions are
leveraged for training, sustainment
and service support. This included
supply chain, global support, field
service, data/tech publications,
spares and repairs, modifications
and retrofits.
Anti-Submarine Warfare
The P-8 executes anti-submarine
warfare (ASW) through an integrated
sensor suite to conduct search,
detection, classification, localization,
tracking and attack of submarines.
The P-8 utilizes a state-of-the-art
acoustics sensor suite, sonobuoys,
electronic support measures (ESM),
inverse synthetic aperture radar (ISAR)
and the delivery of torpedoes for
sub-hunting.
Anti-Surface Warfare
The P-8 executes antisurface warfare
(ASuW) through elegant communications
and data link systems. This integrated
sensor suite conducts search, detection,
classification, localization, tracking and
attack of naval surface targets, utilizing
ESM and intelligence, surveillance and
reconnaissance (ISR) and delivering of
Harpoon missiles.
Maritime ISR
The P-8 accomplishes maritime ISR
through a proven sensor suite of radar
with ISAR, synthetic aperture radar
(SAR), periscope, search and navigation.
These systems are optimized for combat
ready maritime patrol in detecting, locating
and tracking surface and undersea targets.
Overland ISR
The P-8 has the proven capability to
effectively conduct overland ISR and
battle space control (C2) of land forces
using its advanced mission system,
data link and electro-optical/infrared
sensor suite.
Communications
The P-8 has a full complement of radio
frequency communications via Link 11
and 16 to support coordination of
operations. This includes wideband
satellite communications with ground
stations that enable interoperability
with allies and partner nations.
Search and Rescue
With its advanced sensors and long
endurance, further enhanced by in-flight
refueling, the P-8 can search and deliver
rescue stores in large ocean and overland
areas quickly at high and low altitudes.
This includes the carriage and release
of the UNI-PAC search and rescue
survival kit.
322010_P8_Product Card_0620.indd 2 7/6/2020 8:05:49 AM
PROVEN
Multimission Maritime
Aircraft
With surveillance and reconnaissance,
search and rescue, and long-range
anti-submarine capabilities, the P-8
is the most capable multimission
aircraft deployed around the world,
protecting seas and securing borders.
A proven system with more than 100
© aircraft in service and over 300,000
flight hours. The P-8’s performance
and reliability delivers confidence in an
uncertain world — in any condition,
anywhere, anytime.
Technical Specifications
Wingspan aPxCHOnit 37.64 m
Height Ci 12.83 m
Length 129.6 ft 39.50 m
CFM-56-7BE (2) 121.44 kN
27,300 lb
Cre at
564 mi/h
Propulsion
thrust, each
459+ ktas
iC] OYA nny)
41,000 ft 12,496 m
189,200 lb 85,820 kg
1,200+ nm with >4 hr time on
station, 2,225+ km
129 A-size sonobuoys
Imre elele)a
Mk-54
Survival kit
Max speed
Ceiling
MGTW
Range
Weapon stores
compatibility
www.boeing.com/P-8/
Copyright © 2020 Boeing. All rights reserved.
322010 | 7/20
| | 322010_P8_Product Card_0620.indd 2
AFFORDABLE
RELIABLE
Anti-Submarine Warfare
The P-8 executes anti-submarine
warfare (ASW) through an integrated
sensor suite to conduct search,
detection, classification, localization,
tracking and attack of submarines.
The P-8 utilizes a state-of-the-art
acoustics sensor suite, Sonobuoys,
electronic support measures (ESM),
inverse synthetic aperture radar (ISAR)
and the delivery of torpedoes for
sub-hunting.
Anti-Surface Warfare
The P-8 executes antisurface warfare
(ASuW) through elegant communications
and data link systems. This integrated
sensor suite conducts search, detection,
Classification, localization, tracking and
attack of naval surface targets, utilizing
ESM and intelligence, surveillance and
reconnaissance (ISR) and delivering of
Harpoon missiles.
Maritime ISR
The P-8 accomplishes maritime ISR
through a proven sensor suite of radar
with ISAR, synthetic aperture radar
(SAR), periscope, search and navigation.
OVERSEA
These systems are optimized for combat
ready maritime patrol in detecting, locating
and tracking surface and undersea targets.
Overland ISR
The P-8 has the proven capability to
effectively conduct overland ISR and
battle space control (C2) of land forces
using its advanced mission system,
data link and electro-optical/infrared
sensor suite.
Communications
The P-8 has a full complement of radio
frequency communications via Link 11
and 16 to support coordination of
operations. This includes wideband
satellite communications with ground
stations that enable interoperability
with allies and partner nations.
Search and Rescue
With its advanced sensors and long
endurance, further enhanced by in-flight
refueling, the P-8 can search and deliver
rescue stores in large ocean and overland
areas quickly at high and low altitudes.
This includes the carriage and release
of the UNI-PAC search and rescue
survival kit.
e OVERLAND e OVERALL
—
Innovative Solutions
The P-8 combines the most advanced
weapon system in the world with
the cost advantages of the most
popular airliner on the planet. The
P-8 shares 86% commonality with
the commercial 737NG, providing
enormous supply chain economies
of scale in production and support.
Boeing’s expertise in commercial fleet
management and derivative aircraft J
sustainment provides customers
with greater availability at a lower
operational cost.
Integrated Support
The P-8 provides high levels of fleet
availability while reducing life-cycle
cost. Our One Boeing solutions are
leveraged for training, sustainment
and service support. This included
supply chain, global support, field
service, data/tech publications,
spares and repairs, modifications
and retrofits.
United States Contact: Perry Yaw
Telephone: 253-657-0842
Email: perry.d.yaw@boeing.com
International Contact: Tim Flood
Telephone: 301-247-8939
Email: timothy.k.flood@boeing.com
7/6/2020 8:05:49 AM | |
I need to check a bit more why it does not work on your side. Stay tuned.
EDIT: this is the version I'm using locally:
> ~ $ tesseract --version
tesseract 4.1.1
leptonica-1.80.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
I can reproduce the problem:
11:39:35,903 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/usr/local/Cellar/tesseract/4.1.1/bin/tesseract].
11:39:35,903 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/usr/local/Cellar/tesseract/4.1.1/share/tessdata].
11:39:35,903 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
11:39:36,157 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(../docs/1060, ../docs/1060/p8-product-card-2020.pdf) = /p8-product-card-2020.pdf
11:39:36,157 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [../docs/1060/p8-product-card-2020.pdf]: Unable to extract PDF content -> Unable to end a page -> Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
11:39:36,159 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [../docs/1060/p8-product-card-2020.pdf]
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:118) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:173) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.25.jar:1.25]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.25.jar:1.25]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:96) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:439) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:281) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:554) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
... 9 more
Caused by: org.apache.tika.exception.TikaException: Tesseract is not available. Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:437) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:539) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:141) ~[tika-parsers-1.25.jar:1.25]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:399) ~[pdfbox-2.0.21.jar:2.0.21]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:125) ~[tika-parsers-1.25.jar:1.25]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:967) ~[tika-parsers-1.25.jar:1.25]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) ~[pdfbox-2.0.21.jar:2.0.21]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) ~[tika-parsers-1.25.jar:1.25]
... 9 more
Let me see now what is happening 😬
I found the bug. It's actually a documentation issue.
If you define the tesseract path to /home/azureuser/local/bin/
instead of /home/azureuser/local/bin/tesseract
, it should work.
@dadoonet thank you! Confirmed that this now works
Describe the bug
fscrawler claims tesseract is not found, despite setting the OCR
path
anddata_path
in my_settings.yaml
file to correct locations.Tesseract was installed on the system as the non-root user by building from source. The tesseract install seems fine. Quick check of running
--version
on it:Job Settings
Logs
First, build leptonica and tesseract from source ...
Expected behavior
Expected tesseract to be found and used by fscrawler
Versions:
Attachment
If the bug is related to a given file, please share this file so we can reuse it in tests to reproduce the problem and may be use it in our integration tests.