Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

calling external app from crawlling #317

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

hi there I am wondering if is it possible to call an external app .jar file from the collector while is crawling? I am trying to use a Machine learning model (SVN) in order to classify my docs. thanks best regards angelo

essiembre commented 7 years ago

There are a few approaches that may help. You can overwrite the default parser for your content type by adding this inside your <importer> section:

  <documentParserFactory>
      <parsers>
        <parser contentType="application/xml" 
          class="com.norconex.importer.parser.impl.ExternalParser" >
            <command>java -jar /path/to/app.jar ${INPUT} ${OUTPUT}</command>
        </parser>
      <parsers>
  </documentParserFactory>

But if you want to use machine learning after parsing has occurred (as opposed to overwrite default parsing), this may not work.

If you know your Java, I would then recommend you look into creating your own IDocumentTagger that passes the document content to your external app and sets the decoration as metadata. You would add it to your <importer> section like this:

    <postParseHandlers>
        <tagger class="com.blah.MyTagger" />
    </postParseHandlers>

If your external app actually changes the content, look at implementing a IDocumentTransformer instead.

I like the idea of having a Tagger and a Transformer offered out of the box that would invoke an external app. I'll mark this one as a feature request.

In the meantime, let me know if you can make it work with the above suggestions.

essiembre commented 7 years ago

The latest snapshot release now offers an ExternalTransformer. You can use it to invoke an external application that will manipulate your document and/or generate extra metadata. Usage sample:

  <transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
      <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
      <metadata>
          <match field="docnumber">DocNo:(\d+)</match>
      </metadata>
  </transformer>

Use the link above for complete documentation.

Please confirm whether that works for you.