Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

Getting GC overhead limit exceeded exception while crawling 10 MB file #39

Closed jayjamba closed 6 years ago

jayjamba commented 6 years ago

Hi, When I try to crawl 10MB file below exception is getting thrown. Can you please check. I have attached the file that I was trying to crawl.

java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3044) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1440) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370) at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:164) at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:152) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:169) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112) at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:243) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:105) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416) at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150) at com.norconex.importer.Importer.parseDocument(Importer.java:414) at com.norconex.importer.Importer.importDocument(Importer.java:313) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:224) 10MB.docx

essiembre commented 6 years ago

There is a reported issue with the Apache Tika parser used for .docx files. See TIKA-2109. Luckily there is also a solution that worked just fine when I tried it. I am reprinting it here for convenience.

First create a new file anywhere on your system, with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
        <params>
            <param name="useSAXDocxExtractor" type="bool">true</param>
            <param name="includeDeletedContent" type="bool">true</param>
            <param name="includeMoveFromContent" type="bool">true</param>
        </params>
    </parser>
  </parsers>
</properties>

Then, reference that file as a JVM argument called tika.config by adding the following to the java command (modifying the launch script).

-Dtika.config="/path/to/created/file/tika_config.xml"

That should drastically reduce memory and CPU consumption.

I believe this solution is still considered experimental by the Tika devs. Let us know how that works for you. We may consider making it the default in a next release of the Filesystem Collector.

jayjamba commented 6 years ago

Hi, It really worked ! Thanks !

essiembre commented 6 years ago

Glat to hear!

Moshe-Malka commented 4 years ago

for anyone encountering the same issue as me : 1) create a file with above configurations 2) run tika like this: java -Xmx2g -jar tika-server.jar --config=/tmp/tika_config.xml -spawnChild this sets maximum java heap size to 2 GB, routes tika to our config file and spwans the server in a child process to handle errors.