OutOfMemoryError: GC overhead limit exceeded

OkkeKlein commented 9 years ago

What is best approach to fix this?

Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:75) at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:61) at org.apache.fontbox.ttf.PostScriptTable.read(PostScriptTable.java:96) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:299) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:159) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:135) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:314) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)

essiembre commented 9 years ago

The latest snapshot version of the importer includes an updated version PDFBox that creates only 1 scratch file per PDFs (instead of creating thousands when processing large PDFs). So the Importer was changed to use scratch files again, instead of storing it all in memory. Hopefully, this will prevent OOM Errors, and it should also resolve at the same time the "too many files open" issue (tracked here: https://github.com/Norconex/collector-http/issues/99).

essiembre commented 9 years ago

The fix is in the new Importer 2.2.0 stable release.

Norconex / importer

OutOfMemoryError: GC overhead limit exceeded #9