Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
184 stars 67 forks source link

Could not read embedded TTF for font NXRJXX+Arial-ItalicMT error when parsing PDF document #759

Closed onixterry closed 3 years ago

onixterry commented 3 years ago

I have a collector set up to crawl an Intranet site with many PDFs. There are many cases of errors reading embedded fonts in the PDFs.

WARN  - PDTrueTypeFont             - Could not read embedded TTF for font NXRJXX+Arial-ItalicMT
java.io.IOException: Unexpected end of TTF stream reached
        at org.apache.fontbox.ttf.TTFDataStream.read(TTFDataStream.java:274)
        at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:91)
        at org.apache.fontbox.ttf.NamingTable.read(NamingTable.java:113)
        at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
        at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
        at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
        at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:199)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
        at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
        at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
        at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
        at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:421)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:152)
        at com.norconex.importer.Importer.parseDocument(Importer.java:415)
        at com.norconex.importer.Importer.importDocument(Importer.java:313)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
        at com.norconex.importer.Importer.importDocument(Importer.java:190)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:361)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:829)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Is there anything that can be done to prevent the parsing process from throwing an exception here? i.e. a directive to ignore embedded fonts? Getting rid of these errors will make it easier to focus on any "real" errors in the logs.

The issue does not appear to cause any 'real' problem for the collector.

Terry

essiembre commented 3 years ago

Those warnings are indeed harmless for extracting the text but can be annoying for sure. To get rid of them, you can probably change the log level to ERROR.

Locate your log4j.properties file and add this line somewhere:

log4j.logger.org.apache.fontbox=ERROR

Or a bit broader:

log4j.logger.org.apache=ERROR

If those already exist, do not duplicate them, change the existing log level instead.