Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
21 stars 13 forks source link

Getting GC overhead limit exceeded exception while crawling 10 MB file #39

Closed jayjamba closed 6 years ago

jayjamba commented 6 years ago

Hi, When I try to crawl 10MB file below exception is getting thrown. Can you please check. I have attached the file that I was trying to crawl.

java.lang.OutOfMemoryError: GC overhead limit exceeded at$CurLoadContext.attr( at at at at at at at at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse( at org.apache.poi.POIXMLTypeLoader.parse( at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead( at org.apache.poi.POIXMLDocument.load( at org.apache.poi.xwpf.usermodel.XWPFDocument.( at org.apache.poi.xwpf.extractor.XWPFWordExtractor.( at org.apache.poi.extractor.ExtractorFactory.createExtractor( at at at org.apache.tika.parser.CompositeParser.parse( at org.apache.tika.parser.CompositeParser.parse( at org.apache.tika.parser.AutoDetectParser.parse( at org.apache.tika.parser.ParserDecorator.parse( at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse( at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument( at com.norconex.importer.Importer.parseDocument( at com.norconex.importer.Importer.importDocument( at com.norconex.importer.Importer.doImportDocument( at com.norconex.importer.Importer.importDocument( at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute( at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute( at com.norconex.commons.lang.pipeline.Pipeline.execute( at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline( 10MB.docx

essiembre commented 6 years ago

There is a reported issue with the Apache Tika parser used for .docx files. See TIKA-2109. Luckily there is also a solution that worked just fine when I tried it. I am reprinting it here for convenience.

First create a new file anywhere on your system, with the following content:

<?xml version="1.0" encoding="UTF-8"?>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="">
            <param name="useSAXDocxExtractor" type="bool">true</param>
            <param name="includeDeletedContent" type="bool">true</param>
            <param name="includeMoveFromContent" type="bool">true</param>

Then, reference that file as a JVM argument called tika.config by adding the following to the java command (modifying the launch script).


That should drastically reduce memory and CPU consumption.

I believe this solution is still considered experimental by the Tika devs. Let us know how that works for you. We may consider making it the default in a next release of the Filesystem Collector.

jayjamba commented 6 years ago

Hi, It really worked ! Thanks !

essiembre commented 6 years ago

Glat to hear!

Moshe-Malka commented 4 years ago

for anyone encountering the same issue as me : 1) create a file with above configurations 2) run tika like this: java -Xmx2g -jar tika-server.jar --config=/tmp/tika_config.xml -spawnChild this sets maximum java heap size to 2 GB, routes tika to our config file and spwans the server in a child process to handle errors.