Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

OutOfMemoryError: GC overhead limit exceeded #9

Closed OkkeKlein closed 9 years ago

OkkeKlein commented 9 years ago

What is best approach to fix this?

Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:75) at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:61) at org.apache.fontbox.ttf.PostScriptTable.read(PostScriptTable.java:96) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:299) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:159) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:135) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:314) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)

essiembre commented 9 years ago

The latest snapshot version of the importer includes an updated version PDFBox that creates only 1 scratch file per PDFs (instead of creating thousands when processing large PDFs). So the Importer was changed to use scratch files again, instead of storing it all in memory. Hopefully, this will prevent OOM Errors, and it should also resolve at the same time the "too many files open" issue (tracked here: https://github.com/Norconex/collector-http/issues/99).

essiembre commented 9 years ago

The fix is in the new Importer 2.2.0 stable release.