Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

OutOfMemory when parsing a document with multiple embedded objects #19

Closed jetnet closed 8 years ago

jetnet commented 8 years ago

The issue can be reproduced with Cisco Icon Library: zip High CPU usage and OOM. Should be reproducible with a default example configuration with file system connector (OCR is off). Could you please look into that? Thanks a lot!

essiembre commented 8 years ago

I can't reproduce when parsing it using the importer directly. I suspects it happens when combined with other activities. Have you tried increasing the JVM memory? I am not sure how you launch it, but where Java is invoked, you can add -Xmx4096m and replace the number with the amount of memory you want to use.

Given most files in the zip are images, most of what gets parsed may or may not be of much value to you depending what you are looking for. See the extracted content and metadata (metadata is escaped using Java Properties format): output.zip

If you are importing many files and you find this one does not have much value, you can skip it using an import filter.

A different user reported that attempting to extract metadata out of embedded images was also causing him OOM exceptions (was a large PowerPoint file in that case). Unfortunately there is no easy way to just disable this right now, but I can make it a feature request to be able to disable extracting metadata from embedded resources and it would probably help (disabling this via custom coding fixed it in his case).

Let me know if simply increasing the memory used to parse this file solves the problem.

jetnet commented 8 years ago

ah! now I see, what is going on - the parser tries to collect the meta-data from all embedded objects. I was wondering, why the "normal" tika parser does not have a similar issue when parsing the same PPTX file. I tried even with 8Gb - the same OOM issue:

Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3181)
        at java.util.ArrayList.grow(ArrayList.java:261)
        at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
        at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
        at java.util.ArrayList.addAll(ArrayList.java:579)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:1270)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:70)
        at com.norconex.commons.lang.map.ObservableMap.get(ObservableMap.java:94)
        at com.norconex.commons.lang.map.Properties.get(Properties.java:1270)
        at com.norconex.commons.lang.map.Properties.getStrings(Properties.java:561)
        at com.norconex.importer.parser.impl.AbstractTikaParser.addTikaMetadata(AbstractTikaParser.java:210)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:433)
        at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:291)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:200)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:113)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
        at com.norconex.importer.parser.impl.AbstractTikaParser$MergeEmbeddedParser.parse(AbstractTikaParser.java:432)
        at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:166)
        at com.norconex.importer.Importer.parseDocument(Importer.java:422)
        at com.norconex.importer.Importer.importDocument(Importer.java:318)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
        at com.norconex.importer.Importer.importDocument(Importer.java:195)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)

Since the whole framework is highly configurable (which is great), it makes sense to have control on extracting embedded objects as well. Feature request? :) Many thanks!

jetnet commented 8 years ago

BTW: could you please share the command to call the importer directly? Thanks!

essiembre commented 8 years ago

Made it a feature request. For using the importer directly form command line, I suggest you download it from here if you have not already done so and then you will have launch scripts with it. You can find usage details here.

essiembre commented 8 years ago

As a reference, check https://github.com/Norconex/collector-http/issues/251#issuecomment-223592137 for a possible interim solution.

essiembre commented 8 years ago

The Importer module was updated (snapshot release) and now offers more control over what embedded documents you want to have extracted or not. Check out GenericDocumentParserFactory.

For instance, the parsing of some images is what often causes OOMException. To tell it not to extract any embedded images, you can do the following:

  <documentParserFactory>
      <embedded>
          <noExtractEmbeddedContentTypes>
            image.*
          </noExtractEmbeddedContentTypes>
      </embedded>
  </documentParserFactory>

Alternatively, in your case you can also tell it to skip extracting any embedded document for PowerPoints. It would be done like this:

  <documentParserFactory>
      <embedded>
          <noExtractContainerContentTypes>
            application/vnd.openxmlformats-officedocument.presentationml.presentation
          </noExtractContainerContentTypes>
      </embedded>
  </documentParserFactory>

Embedded documents that are not extracted won't be parsed and won't generate the OOMException.

Please confim when you get a chance.

V3RITAS commented 8 years ago

I've tested this new feature with the Powerpoint file which causes the OOM exception (see this ticket) and it works fine. Great!

Just one question: Is it possible to define several noExtractEmbeddedContentTypes or noExtractContainerContentTypes entries or is this just one regex you can enter here?

essiembre commented 8 years ago

No, you cannot define multiple right now, but regex allows you to have multiple choices by separating each one with a vertical bar, in between parenthesis, like this:

  <noExtractContainerContentTypes>
     (aContentType|anotherContentType|yetAnotherOne)
  </noExtractContainerContentTypes>

I am closing this since it resolves the issue, but I have added a TODO item in the project to investigate how to reduce the amount of metadata extracted for images since in the cases where it is problematic, the massive amount of metadata extracted is of no value to 99.99% of use cases.