Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Splitting PST Files (Microsoft Exchange format) with DOM splitter #101

Closed CorinneKnoe closed 4 years ago

CorinneKnoe commented 5 years ago

I am trying to split PST files (contain entire mailboxes) into its elements: emails, attachments, contacts, calendar entries, etc.

Norconex is able to read the PST file, however it returns the entire thing in one file. I tried to use DOM splitter to split the PST into components. So far, no success.

<importer>

        <preParseHandlers>
        <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
                        selector=".pst"
                        sourceCharset="UTF-8">
        </splitter>
        </preParseHandlers>

        <postParseHandlers>
          <tagger class="${tagger}.ReplaceTagger">
            <replace fromField="samplefield" regex="true">
              <fromValue>ping</fromValue><toValue>pongdidididponglalalala</toValue>
            </replace>
            <replace fromField="Subject" regex="true">
                <fromValue>Sample to crawl</fromValue><toValue>Sample crawled</toValue>
            </replace>            
          </tagger>
        </postParseHandlers>
</importer>

Is Norconex able to split PST files? Many thanks, Corinne

essiembre commented 5 years ago

Hello Corinne,

Yes, the Norconex Importer can split PST files. Have a look at GenericDocumentParserFactory. You will find many options for controlling exactly what/how you want the split embedded documents. In its most open form (split every embedded documents), it would look like this:

<importer>
  <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
    <embedded>
      <splitContentTypes>.*</splitContentTypes>
    </embedded>
  </documentParserFactory>
</importer>

You probably want to configure it so it only splits certain content types.

CorinneKnoe commented 5 years ago

Thank you very much for your help and the code example. Will give this a try!