Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

OutOfMemoryError: GC overhead limit exceeded #9

Closed OkkeKlein closed 9 years ago

OkkeKlein commented 9 years ago

What is best approach to fix this?

Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:75) at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStream.java:61) at org.apache.fontbox.ttf.PostScriptTable.read(PostScriptTable.java:96) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:299) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:159) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:135) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:314) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)

OkkeKlein commented 9 years ago

using -Dxmx1560m still throws

Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:77) at org.apache.fontbox.ttf.MemoryTTFDataStream.(MemoryTTFDataStream.java:45) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:314) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473)

essiembre commented 9 years ago

Interesting.. it looks like it is coming form PDFbox. Can you tell how large is the PDF it is trying to parse? I wonder if PDFBox tries to load it all in memory.

OkkeKlein commented 9 years ago

Bit hard to track down without proper logging (only output in console) but as exceptions are thrown almost instantly I can determine that the contentsize is a few hundred Kb per document.

Maybe you can catch the exception or add some logging to get a better insight?

OkkeKlein commented 9 years ago

It doesn't seem to be a PDFBox only problem:

Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.tagger.AbstractStringTagger.tagTextDocument(AbstractStringTagger.java:82) at com.norconex.importer.handler.tagger.AbstractCharStreamTagger.tagApplicableDocument(AbstractCharStreamTagger.java:78) at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:56) at com.norconex.importer.Importer.tagDocument(Importer.java:514) at com.norconex.importer.Importer.executeHandlers(Importer.java:346) at com.norconex.importer.Importer.importDocument(Importer.java:317) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.tagger.AbstractStringTagger.tagTextDocument(AbstractStringTagger.java:82) at com.norconex.importer.handler.tagger.AbstractCharStreamTagger.tagApplicableDocument(AbstractCharStreamTagger.java:78) at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:56) at com.norconex.importer.Importer.tagDocument(Importer.java:514) at com.norconex.importer.Importer.executeHandlers(Importer.java:346) at com.norconex.importer.Importer.importDocument(Importer.java:317) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.tagger.AbstractStringTagger.tagTextDocument(AbstractStringTagger.java:82) at com.norconex.importer.handler.tagger.AbstractCharStreamTagger.tagApplicableDocument(AbstractCharStreamTagger.java:78) at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:56) at com.norconex.importer.Importer.tagDocument(Importer.java:514) at com.norconex.importer.Importer.executeHandlers(Importer.java:346) at com.norconex.importer.Importer.importDocument(Importer.java:317) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

OkkeKlein commented 9 years ago

And another different one

Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space at com.carrotsearch.hppc.IntArrayList.ensureBufferSpace(IntArrayList.java:368) at com.carrotsearch.hppc.IntArrayList.add(IntArrayList.java:100) at org.carrot2.text.suffixtree.SuffixTree.addTransition(SuffixTree.java:376) at org.carrot2.text.suffixtree.SuffixTree.createTransition(SuffixTree.java:349) at org.carrot2.text.suffixtree.SuffixTree.update(SuffixTree.java:234) at org.carrot2.text.suffixtree.SuffixTree.(SuffixTree.java:202) at org.carrot2.text.suffixtree.SuffixTreeBuilder.build(SuffixTreeBuilder.java:58) at org.carrot2.clustering.stc.GeneralizedSuffixTree$SequenceBuilder.buildSuffixTree(GeneralizedSuffixTree.java:118) at org.carrot2.clustering.stc.STCClusteringAlgorithm.cluster(STCClusteringAlgorithm.java:448) at org.carrot2.clustering.stc.STCClusteringAlgorithm.access$000(STCClusteringAlgorithm.java:75) at org.carrot2.clustering.stc.STCClusteringAlgorithm$2.process(STCClusteringAlgorithm.java:398) at org.carrot2.text.clustering.MultilingualClustering.clusterByLanguage(MultilingualClustering.java:283) at org.carrot2.text.clustering.MultilingualClustering.process(MultilingualClustering.java:162) at org.carrot2.clustering.stc.STCClusteringAlgorithm.process(STCClusteringAlgorithm.java:391) at org.carrot2.core.ControllerUtils.performProcessing(ControllerUtils.java:106) at org.carrot2.core.Controller.process(Controller.java:357) at org.carrot2.core.Controller.process(Controller.java:247) at org.carrot2.core.Controller.process(Controller.java:224) at com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger.getCarrotTitle(TitleGeneratorTagger.java:289) at com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger.tagStringContent(TitleGeneratorTagger.java:179) at com.norconex.importer.handler.tagger.AbstractStringTagger.tagTextDocument(AbstractStringTagger.java:96) at com.norconex.importer.handler.tagger.AbstractCharStreamTagger.tagApplicableDocument(AbstractCharStreamTagger.java:78) at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:56) at com.norconex.importer.Importer.tagDocument(Importer.java:514) at com.norconex.importer.Importer.executeHandlers(Importer.java:346) at com.norconex.importer.Importer.importDocument(Importer.java:317) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)

essiembre commented 9 years ago

If you don't mind sharing the sites you crawl, can you paste your config(s) to help reproduce? Or link to your config somehow -- I wish we could attach files to github issues.

OkkeKlein commented 9 years ago

Unfortunately I can't share the URL's. But it's a couple of thousand PDF's linked from rss feeds. It uses a custom link extractor. There is no way to catch the exceptions and log the reference?

BTW It might be a good idea to create a Google group or something, so you can attach files and have a place to discuss issues before taking them to Github.

essiembre commented 9 years ago

I like your idea of creating a Google group or equivalent. I will give it serious consideration if that github limitation gets too painful.

Maybe the reason this one is not caught is probably because it is a Throwable exception. Since it can happen anywhere (as your stacktraces suggest), we may not always have a reference. There might be ways to improve this though. This being said, we are already offering something to do just that when parsing fails in the Importer module, but I do not know if it works in your case (with a Throwable). It is worth trying it:

Add this tag between your <importer> tags:

<parseErrorsSaveDir>/path/where/to/save/these</parseErrorsSaveDir>

In case it helps, you can also try using <keepDownloads>true</keepDownloads> and try to locate most recent files when it failed, but that may be unrealistic if you have tons of files. Plus, if that's an accumulation of things over time that makes it fail, so it may not be tied to specific files.

Finally, short of giving the URLs to your RSS feeds, can you describe the average size of your PDFs, the number of threads you are using, the delay, and maybe the importer handlers you are using (taggers, transformers, etc). There has to be a way to reproduce this easily with a similar setup.

OkkeKlein commented 9 years ago

Using the parseErrorsSaveDir I get

Console message [Fatal Error] :6:80: The entity "C" was referenced, but not declared.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@554b2133 at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:162) at com.norconex.importer.Importer.parseDocument(Importer.java:415) at com.norconex.importer.Importer.importDocument(Importer.java:315) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@554b2133 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:258) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) ... 14 more Caused by: java.lang.ArrayIndexOutOfBoundsException: 153 at org.apache.fontbox.ttf.CmapSubtable.processSubtype6(CmapSubtable.java:347) at org.apache.fontbox.ttf.CmapSubtable.initSubtable(CmapSubtable.java:98) at org.apache.fontbox.ttf.CmapTable.read(CmapTable.java:79) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:299) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:159) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:135) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:96) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:130) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:93) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:50) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) ... 18 more

No OOM, but then soon

Exception in thread "pool-1-thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.tagger.AbstractStringTagger.tagTextDocument(AbstractStringTagger.java:83) at com.norconex.importer.handler.tagger.AbstractCharStreamTagger.tagApplicableDocument(AbstractCharStreamTagger.java:78) at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:56) at com.norconex.importer.Importer.tagDocument(Importer.java:515) at com.norconex.importer.Importer.executeHandlers(Importer.java:347) at com.norconex.importer.Importer.importDocument(Importer.java:318) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

And in parseErrorSaveDir a new file showed up:

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unable to extract PDF content at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:162) at com.norconex.importer.Importer.parseDocument(Importer.java:415) at com.norconex.importer.Importer.importDocument(Importer.java:315) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:160) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:172) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) ... 14 more Caused by: java.io.IOException: Image stream was not read - filter: DCTDecode at org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:272) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:116) at org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:65) at org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:194) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.extractImages(EnhancedPDF2XHTML.java:348) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.endPage(EnhancedPDF2XHTML.java:247) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:349) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:283) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:248) at org.apache.tika.parser.pdf.EnhancedPDF2XHTML.process(EnhancedPDF2XHTML.java:144) ... 21 more

OkkeKlein commented 9 years ago

I also get a few parsed docs with the first line double, So "This artitle is about catsThis article is about cats". Parsing the file direct with PDFBox does not show this behavior. No clue where to start looking for cause, but just wanted to mention this in case this gave some insight.

essiembre commented 9 years ago

Direct with PDFBox 2.0.0? We are using a snapshot release of PDFBox as it fixes several issues we had with the 1.x version. Maybe we are finding it is not stable enough yet... I'll try to reproduce your errors when I get a chance.

essiembre commented 9 years ago

Maybe also a time to try Xpdf as you previously suggested. :-)

OkkeKlein commented 9 years ago

I used pdfbox-app-1.8.9.jar for parsing. I can give you the pdf content that gave exception. What is the best (private) way for this?

I always liked Xpdf (pdtotext) )and apparently it is more memory efficient.

essiembre commented 9 years ago

You can use the email address you'll find under my github profile. If it's too big for email, you can email me a link to it (e.g. dropbox). If the two options do not work, I can find you a place to drop it.

essiembre commented 9 years ago

Until the memory exception can be resolved, a new Importer snapshot release has been made allowing you to use an external parser. A new classed called ExternalParser has been created for this (based on Apache Tika parser of the same name).

To use it, create your own document parser factory that you can put under the "classes" folder of your installation (having a directory structure that matches your package declaration in Java, if any). Here is an example called CustomDocumentParserFactory.java you can use:

import java.util.Map;

import com.norconex.commons.lang.file.ContentType;
import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.ExternalParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {

    @Override
    protected Map<ContentType, IDocumentParser> createNamedParsers() {

        ExternalParser pdfParser = new ExternalParser();
        pdfParser.setCommand(
                // Replace this with your own executable path
                "C:\\Apps\\xpdfbin-win-3.04\\bin64\\pdftotext.exe", 
                "-enc", "UTF-8", "-raw", "-q", "-eol", "unix",                 
                ExternalParser.INPUT_FILE_TOKEN, 
                ExternalParser.OUTPUT_FILE_TOKEN);

        Map<ContentType, IDocumentParser> parsers = super.createNamedParsers();
        parsers.put(ContentType.PDF, pdfParser);
        return parsers;
    }
}

Once this class is added to your classpath, make sure to reference it in your configuration like this:

<importer>
    ...
    <!-- prepend the class name by your package name if you declared any -->
    <documentParserFactory class="CustomDocumentParserFactory" />
    ...
<importer>

Because this custom parser extends the default GenericDocumentParserFactory, you can use the extra configuration options available for it.

At some point I plan to make (applicable) parsers configurable via XML like the rest but for now, you'll have to deal with the above code.

OkkeKlein commented 9 years ago

Added class to importer and completed a full crawl without OOM.

OkkeKlein commented 9 years ago

EOL parsing seems to go wrong with pdftotext resulting in concatenated words.

essiembre commented 9 years ago

If you check the class sample I provided, the following two arguments specifies what EOL character:

"-eol", "unix"

You can try replacing unix with one of "dos" or "mac".

Do you have that issue when you run pdf2text on the command prompt as well? If so, I recommend you check with pdf2text authors.

OkkeKlein commented 9 years ago

No, command line is working fine. And I am using linux so unix eol should work.

essiembre commented 9 years ago

Interesting. I tried with several PDFs without being able to reproduce the issue.

Can you print the full command you are using on command line, along with our OS and ideally a copy of your config? Do you have this issue with any PDFs or just specific ones? If the later, can you share one? If one of the ones you sent me via email earlier causes the issue, just let me know which one. I will try to reproduce.

OkkeKlein commented 9 years ago

The behavior is with all PDF's it seems. I dropped a concat.pdf in Dropbox that shows behavior really good. I tested with both unix and dos eol and this made no difference in end result.

Edited:Same files parsed with PDFbox (with crawler not pdfbox-app) didn't have the issue. With last snapshot they are also concatenated.

Got both an OOM and too many files open with 2 threads.

[Fatal Error] :6:80: The entity "C" was referenced, but not declared. Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.transformer.AbstractStringTransformer.transformTextDocument(AbstractStringTransformer.java:82) at com.norconex.importer.handler.transformer.AbstractCharStreamTransformer.transformApplicableDocument(AbstractCharStreamTransformer.java:66) at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:59) at com.norconex.importer.Importer.transformDocument(Importer.java:542) at com.norconex.importer.Importer.executeHandlers(Importer.java:348) at com.norconex.importer.Importer.importDocument(Importer.java:317) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:64) at java.lang.StringBuilder.(StringBuilder.java:97) at com.norconex.importer.handler.transformer.AbstractStringTransformer.transformTextDocument(AbstractStringTransformer.java:82) at com.norconex.importer.handler.transformer.AbstractCharStreamTransformer.transformApplicableDocument(AbstractCharStreamTransformer.java:66) at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:59) at com.norconex.importer.Importer.transformDocument(Importer.java:542) at com.norconex.importer.Importer.executeHandlers(Importer.java:348) at com.norconex.importer.Importer.importDocument(Importer.java:317) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

essiembre commented 9 years ago

I checked all your .txt files on dropbox and they are all OK. They all contain line feeds. So they should be OK to look at on linux systems. On windows, the lines will appear concatenated in Notepad, but any descent editor should print them properly. I also tried running it myself and the output was the same as your .txt files. Using Notepad++, I am showing invisible characters and I can clearly see there are line feeds in your file (I blurred the lines):

linefeeds

Even if you cut-n-paste a few lines from your contact.txt file here in github comment text area, you will see multiple lines (no concatenation performed).

Unless I am missing something?

OkkeKlein commented 9 years ago

Yeah, the concatenated words are in the files that are committed (Solr). So while parsers work fine, the end result is different.

It used to be only with pdftotext, but with latest snapshot I see same with pdfbox.

OkkeKlein commented 9 years ago

Will open new issue for the concatenated terms.

Back to main issue. Ran a crawl with -XX:+HeapDumpOnOutOfMemoryError and the dump is uploading to Dropbox. I hope this gives some insight.

essiembre commented 9 years ago

Got it.

essiembre commented 9 years ago

Your heap dump is quite informative. It looks like when the OOM error occurs, over 95% of used memory is all in a single char[] array (close to 600MB). Hard to tell with assurance, but it seems to be filled up in a ReplaceTransformer instance. Do you make use of ReplaceTransformer? I wonder if you can try without it to see if you still get the OOMException (knowing that's not what you want... just as a test).

That class (and a few others), tries to take as much memory as possible by checking what remains first so it does not go overboard. But maybe that does not leave enough for other threads at some point. For curiosity, were you able to reproduce with a single thread?

If you confirm you are indeed using ReplaceTransformer, I can change it around to be much nicer with memory and hopefully resolve this.

I can print here part of the text in that character array in case you want to identify which document it is.

OkkeKlein commented 9 years ago

Yes I am using ReplaceTransformer.

I can only crawl in certain time frames, so I rather test a less memory using ReplaceTransformer :)

The PDF that caused the OOM was 873kB big and parses just fine using command line pdfbox and pdftotext. You can find it in Dropbox.

essiembre commented 9 years ago

What's your replace config like? I'm trying to reproduce.

OkkeKlein commented 9 years ago

`<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer" caseSensitive="false" >

%:;!=,@\-\+\?&\\\/\(\)\p{Nl}\p{L}\p{Sc} ]]]>
    </transformer>`
essiembre commented 9 years ago

I do not see a file that's 873KB in dropbox at the usual location. I see only two files: concat.pdf (30KB) and your heap dump file. Can you re-add it?

OkkeKlein commented 9 years ago

Added.

essiembre commented 9 years ago

Got it, but I still can't reproduce! The text gets extracted without OOM exception. I am at a point now where I suggest you share your whole config and I will run the exact same thing you are. Is this a possibility?

OkkeKlein commented 9 years ago

I didn't expect the file to give OOM. It looks more like a memory leak to me that builds up and results in OOM.

Unfortunately the server is not public, so I will have to do the testing for you.

I will try and gather more info.

essiembre commented 9 years ago

OK, if there is a leak, I should be able to reproduce after looping X number of times over that document. Do you have an idea after how many documents this occurs? I will try with huge amounts... so if there is a leak, it will break.

FYI, since it is believed the issue in with the importing, I am testing with the Importer only. I created myself a small test app that shares an importer instance between many threads and continuously imports many docs. So I am not going through a web crawl in my attempt to reproduce. I may have to do that at some point if it turns out the leak is not with the importer. The max I tried so far was to process 68 PDFs, totaling 243 MB. I'll try with many more when I have a chance.

OkkeKlein commented 9 years ago

I crawled about 3 hours and 5000 PDF's before the OOM occurs.

I am running a crawl now with verbose GC.

OkkeKlein commented 9 years ago

Crawler just ended after reading all files with links to PDF's and crawling 1 PDF. GC log is in Dropbox.

Will resume crawl to get OOM, but it's weird it stopped after 1 PDF while there are 7000 in queue.

OkkeKlein commented 9 years ago

Step by step:

  1. Started crawl (2 threads, Tika parser) 2, Crawl ended (not aborted) after depth 1 was finished and 1 PDF was crawled (no idea why)
  2. Resumed crawl.
  3. Now rest of PDF are also crawled.
  4. Noticed dump, so stopped crawl.

The dump is different, it's uploading to Dropbox. Will try to resume crawl to get OOM again.

OkkeKlein commented 9 years ago

Crawler did not finish. Too many open files (ulimit is unlimited and only using 2 threads)

assets: 2015-05-12 19:43:54 FATAL - assets: An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues... com.norconex.jef4.JEFException: Cannot persist status update for job: assets

Looking at GC log i noticed PermGen is full. So I'm gonna try and run crawl with bigger PermGen.

essiembre commented 9 years ago

I finally was able to reproduce. I ran with 10 threads and kept reprocessing your file in the same Java app. Between the 1890th and 1900th time it got processed, I had the OOM Error. The odd thing is... I was tracking JVM memory indicator during the whole time. No leak could be detected and the used and free memory were always reset every time garbage collection kicked in. I have 8GB of ram on my PC. 2GB was the max memory (-Xmx) set by the JVM (which is the default 1/4 OS memory -- using JDK 7).

What I did notice is... the JVM max memory dropped just before the OOM Error. You can see this in the picture below. I am not the most knowledgeable with JVM max memory settings, but I thought that once set, it can't really go lower. I tried to research online what could make it decrease in mid-process but I could not find causes. Can it simply be too many other OS processes competing for the memory?

oomerror

This image tells me it is possible the OOM Error is not the result of a leak, but the result of a sudden drop in JVM max memory, going below the used memory at any given time. I am thinking this probably happens randomly at any time.

The safest bet might be to use an absolute amount of memory for some of the handlers and not rely on what is "beleived" to be the available JVM memory (in an attempt to have some crawler tasks use as much as possible). This may "sometimes" slow down execution a bit for some of the handlers on large files (not being able to put all file content in memory for manipulation). Overall though, the performance for the majority of the crawled documents should not be impacted and it would probably be safer (eliminating OOM Error), so that's the direction I'll go.

Let me know if you think my observations make no sense. :-)

essiembre commented 9 years ago

Oddly, I can reproduce very easily now with just two threads. Most of the time it happens before the document is processed even 10 times (usually after the first or second doc). That should make my testing easier.

OkkeKlein commented 9 years ago

Increasing MaxPermSize didn't solve the issue. I added some more heap dumps to give more insight.

essiembre commented 9 years ago

I played with string-based handlers and followed the approach I suggested earlier. Limiting the memory used by these seems to solve this problem. My test are quite good right now and a lot of free memory always remains in the JVM. I will try to commit and make a new snapshot release today.

essiembre commented 9 years ago

OK, now starts the suspense.... can you try with the latest Importer Jar found in this snapshot release I just created?

essiembre commented 9 years ago

For the curious, here is an updated memory picture running the same test after the code change to use a fixed amount of memory: oomerror-after-rewrite

No leak, plenty of memory available.

OkkeKlein commented 9 years ago

Great! Crawler is running with new snapshot. Will keep you informed of results.

OkkeKlein commented 9 years ago

It only crawled depth 1 with the files with links to PDF's. Have to start crawl with resume to crawl the PDF's.

Very weird behavior as it has many more links in queue, but never crawls those.

OkkeKlein commented 9 years ago

Wasn't this fixed?

Exception in thread "pool-1-thread-1" java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDDocumentInformation.getDictionary()Lorg/apache/pdfbox/cos/COSDictionary; at org.apache.tika.parser.pdf.EnhancedPDFParser.extractMetadata(EnhancedPDFParser.java:300) at org.apache.tika.parser.pdf.EnhancedPDFParser.parse(EnhancedPDFParser.java:162) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:374) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:159) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:314) at com.norconex.importer.Importer.doImportDocument(Importer.java:267) at com.norconex.importer.Importer.importDocument(Importer.java:195) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

essiembre commented 9 years ago

Yes it was, and I made sure all my tests were running fine. Maybe a latest PDFBox version sneaked in when I was building the snapshots (it pulls the latest one at that point). Let me check and create a new build if that's the case.

essiembre commented 9 years ago

I suspect other dependencies are missing. I was probably wrong to say to just copy the importer JAR. Can you copy all Jars in the importer snapshot to the lib/ folder (and check for duplicates). I suspect pdfbox, fontbox and maybe some others were updated also.

essiembre commented 9 years ago

Based on offline discussions with @OkkeKlein, I am marking this memory issue as fixed. The latest Importer snapshot does not generate OOM errors. There is still a "Too many files open" issue with PDFBox remaining for some Linux/Unix crawls, but it is being tracked separately here for now.