Closed jayjamba closed 6 years ago
There is a reported issue with the Apache Tika parser used for .docx files. See TIKA-2109. Luckily there is also a solution that worked just fine when I tried it. I am reprinting it here for convenience.
First create a new file anywhere on your system, with the following content:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<params>
<param name="useSAXDocxExtractor" type="bool">true</param>
<param name="includeDeletedContent" type="bool">true</param>
<param name="includeMoveFromContent" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
Then, reference that file as a JVM argument called tika.config
by adding the following to the java command (modifying the launch script).
-Dtika.config="/path/to/created/file/tika_config.xml"
That should drastically reduce memory and CPU consumption.
I believe this solution is still considered experimental by the Tika devs. Let us know how that works for you. We may consider making it the default in a next release of the Filesystem Collector.
Hi, It really worked ! Thanks !
Glat to hear!
for anyone encountering the same issue as me : 1) create a file with above configurations 2) run tika like this: java -Xmx2g -jar tika-server.jar --config=/tmp/tika_config.xml -spawnChild this sets maximum java heap size to 2 GB, routes tika to our config file and spwans the server in a child process to handle errors.
Hi, When I try to crawl 10MB file below exception is getting thrown. Can you please check. I have attached the file that I was trying to crawl.
java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3044) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1440) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370) at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:164) at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:152) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:169) at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:112)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:243)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:105)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:416)
at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:150)
at com.norconex.importer.Importer.parseDocument(Importer.java:414)
at com.norconex.importer.Importer.importDocument(Importer.java:313)
at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
at com.norconex.importer.Importer.importDocument(Importer.java:190)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.fs.crawler.FilesystemCrawler.executeImporterPipeline(FilesystemCrawler.java:224)
10MB.docx