Closed OkkeKlein closed 9 years ago
The latest snapshot version of the importer includes an updated version PDFBox that creates only 1 scratch file per PDFs (instead of creating thousands when processing large PDFs). So the Importer was changed to use scratch files again, instead of storing it all in memory. Hopefully, this will prevent OOM Errors, and it should also resolve at the same time the "too many files open" issue (tracked here: https://github.com/Norconex/collector-http/issues/99).
The fix is in the new Importer 2.2.0 stable release.
What is best approach to fix this?
Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:75) at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:61) at org.apache.fontbox.ttf.PostScriptTable.read(PostScriptTable.java:96) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:299) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:159) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:135) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248)
at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144)
at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117)
at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374)
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159)
at com.norconex.importer.Importer.parseDocument(Importer.java:414)
at com.norconex.importer.Importer.importDocument(Importer.java:314)
at com.norconex.importer.Importer.doImportDocument(Importer.java:267)
at com.norconex.importer.Importer.importDocument(Importer.java:195)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)