Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Question: External application tagger #63

Closed jmrichardson closed 6 years ago

jmrichardson commented 6 years ago

Can you please recommend how to accomplish using an external application to tag documents. I need to be able to tag documents using its content and metadata (document.reference specifically) for things like category, type, entities, etc... I have looked at:

ExternalTransformer: It appears that this takes only the content data (not metadata) and process it while allowing you to tag it in metadata. Is there a way to also pass in metadata to externaltransformer? Also, it appears that when you can match metadata using the output of the command. However, I do not want to update the content with the tags as it is returned. That is, I want to keep the content pristine and only tag the document. It looks like it could work if I was able to pass in metadata and also not change the content when passed back?

ScriptTagger: This looks more promising as the metadata and content are passed to the javascript engine. However, I am not sure if I can execute an external command to process the data?

What would you recommend to be able process the content (not modify) with metadata, and tag the document (multiple tags) using the output of the command?

Thanks for your help

PS. the app is R or python

essiembre commented 6 years ago

The ExternalTransformer can be used as both a pre-parse handler or post-parse handler. If you use it as the first element in a pre-parse handler, you should get the document as-is, without modifications yet. As an example, if you are dealing with an HTML page, the <meta ...> fields will be in the document (not yet extracted). Also, if you create an output file ${OUTPUT}, then the STDOUT can be used for outputting new metadata.

Is your app quite complex? If not, you can probably replicate the logic using existing Importer handlers (avoiding calling an external app).

Are you dealing with text files or binary?

We could make this a feature request to also supply metadata to the external app, but it would only work when using ${INPUT} (passing a file path as an argument). Otherwise, the STDIN will be taken for the content already and we can't really safely merge them (e.g. binary files).

jmrichardson commented 6 years ago

Yes, the application is complex. The external app will be making DB calls, using machine learning to categorize, natural language processing (opennlp) for entity recognition, etc..

The files are primarily word, pdf, txt (no HTML). The text content will already need to be extracted for my application to process (post-parse handler) prior to sending to ES.

Do you mean ExternalTransformer as I don't see ExternalTagger in the documentation. I really like the idea of having an external tagger that does not modify the document content, but rather only adds the metadata fields (tags).

Based on your suggestion above, would the ExternalTransformer behave as following (see example below):

<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
      <command>/path/app.R ${INPUT}</command>
      <metadata>
          <match field="docnumber" valueGroup="1">DocNo:(\d+)</match>
      </metadata>
  </transformer>

The app.R in the snip above, would receive the full 'document.reference' field or the entire metadata as a string:

/path/app.R 'metadata'

I was also wondering if it is possible that it may be able to do what I want as is:

I have not worked with the ExternalTransformer yet so I hope the above makes sense. Let me know what you think is the best way and happy to open a feature request or provide more detail.

Thanks again!

jmrichardson commented 6 years ago

I tested the ExternalTransformer and realized that the files for ${INPUT} ${OUTPUT} are tmp files of the form:

C:\Users\john\AppData\Local\Temp\input895347324334751575.tmp

This means that option 2 above won't work because the associated "-add.ref" file is not available.

Your suggestion would be great to pass the metadata as an argument via $INPUT. Or, in my use case, just be able to send the document.reference string via $INPUT. The former requires an I/O operation and parsing of the metadata while the latter should be more efficient.

I will open a feature request for this and reference this.

Thank you

jmrichardson commented 6 years ago

One other thing I just noticed is that the ExternalTransformer is expecting the content from either ${OUTPUT} or STDOUT. In my case, I don't want to change the content but rather just create additional metadata (from the content and existing metadata ie document.reference) to be stored in ES. I believe this means that I have to write the identical content to a file which is an unnecessary step . In addition, importer would have to do a read operation to get the content (which hasn't changed). If have 6M+ files to index and worried about the I/O overhead.

I am really hoping you could create an ExternalTagger (similar to ExternalTransfomer) where you could provide the content (STDIN) and metadata (${INPUT}), and then just regex the STDOUT for the new metadata fields.

essiembre commented 6 years ago

I corrected my comment. I meant ExternalTransformer but having an ExternalTagger is a good idea (your new ticket).

I am marking this one as a feature request as well, to be able to get metadata into the external app somehow.

Thanks for your input.

essiembre commented 6 years ago

This feature has been implemented and is part of the latest snapshot release.

The ExternalTransformer now support new command line tokens that will be replaced with appropriate values (file paths, URL, etc) when specified in your command. These new ones are ${INPUT_META}, ${OUTPUT_META}, and ${REFERENCE}.

Have a look at the class documentation and let me know how it works for you.

essiembre commented 6 years ago

The ExternalTagger is now part of the official release (2.8.0).