external Transformer filter

angelo337 commented 6 years ago

Hi there,

I am trying to extract content from a file type very special, and I manage to convert it to HTML, however when i try to put al te information back from the output all information goes to a custom field because "content" field never pass at all, is it possible to clean HTML with a custom defined field?

here is my config:

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
            <command>
              "C:\norconex-collector-filesystem-2.7.1\extract.cmd" ${INPUT} ${OUTPUT}
            </command>
            <metadata>
              <match field="content_ascii" valueGroup="1">(.*)</match>
            </metadata>
          </transformer>

could you please provide me with some help, with this issue?

thanks a lot angelo

essiembre commented 6 years ago

Which version are you using? I recommend you try 2.8.0 (snapshot) because there were significant improvements made to the ExternalTransformer.

I am not sure I understand your issue. Do you want the content as a field? Because right now, by having ${OUTPUT} in your command, you are telling it to grab the content from a file, so the importer will treat it as a file, not a field. What is the output of your extract.cmd? Do you write to a file (path given as the second argument), or do you write to STDOUT (console)?

angelo337 commented 6 years ago

Pascal: thanks for your fast Answer. Do you want the content as a field? I would like to have the output as a Content field, What is the output of your extract.cmd? HTML content, like in the following output:

 "content_ascii_txt":["--- OVM Domain: DEF ---",
          "--- OVM Version 9.0.0 --- BP:0 TL:0 DB:0 FS:0 RE:4",
          "--- UTL_MEM Version 3.0.0 --- Dbg msg -",
          "--- ValueTypes 9.0 ---",
          "%FIO-W-SFMNEX, IRREGULAR PRIMARY INDEX 4477.5 EXPECTED 4489. LU 1 FT 1",
          "[14 NOV 2017 18:04:31] ODL-I-Input       C:\\DOCUME~1\\ADMINI~1\\LOCALS~1\\Temp\\input7951598160920256695.tmp opened",
          "[14 NOV 2017 18:04:31] ODL-S-OpenFileRead Opened SU C:\\DOCUME~1\\ADMINI~1\\LOCALS~1\\Temp\\input7951598160920256695.tmp FILE-ID: 305146.001 of SSet LIS to DLIS Conversion on LU 1",
          "[14 NOV 2017 18:04:31] ODL-I-FrameReadErrors 1 error(s) were encountered during reading of frame data.",
          "<html><head><meta http-equiv=\"Content-Type\"",
          "content=\"text/html; charset=iso-8859-1\">",
          "<meta name=\"GENERATOR\" content=\"Schlumberger DlisView 18C0-148\">",
          "<title>Verification Listing</title></head>",
          "<body bgcolor=\"#FFFFFF\">",
          "<table border=\"1\" cellspacing=\"0\" width=\"100%\">",

Do you write to a file (path given as the second argument), or do you write to STDOUT (console)? I am writing the content to a file however the name of the file is not know in advance because the program run on CMD generate a file internally; after run I just "type" any output with an extension of HTML to the STDOUT (console).

Also as you can see is not in a single field value all the content but in several records that are defined by a "\n" character, is it possible to have all in a single record?

I hope that clarify my situation a little bit more.

thanks angelo

essiembre commented 6 years ago

You did not specify which version you were using, but assuming the latest snapshot, I think I know how to accomplish what you want. If you have no out file, remove ${OUTPUT} from your command. Then your metadata pattern matching should work. They will work against each line returned, but fear not, each matching line will be stored as a separate entry, creating a multi-value field (array). If you just want a single value field, you can merge all values obtained with MergeTagger or ForceSingleValueTagger. One example for your content_ascii field:

  <tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
      <merge toField="content_ascii" singleValue="true" singleValueSeparator=" ">
        <fromFields>content_ascii</fromFields>
      </merge>
  </tagger>

angelo337 commented 6 years ago

My apologies, I am working with 2.8. It is working with your changes and some more config from my side, in order to avoid crawling from the document parser, I include this config:

` <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">

application/octet-stream

    </documentParserFactory>`

thanks for you help angelo

Norconex / importer

external Transformer filter #68