Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

external Transformer filter #68

Closed angelo337 closed 6 years ago

angelo337 commented 6 years ago

Hi there,

I am trying to extract content from a file type very special, and I manage to convert it to HTML, however when i try to put al te information back from the output all information goes to a custom field because "content" field never pass at all, is it possible to clean HTML with a custom defined field?

here is my config:

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
            <command>
              "C:\norconex-collector-filesystem-2.7.1\extract.cmd" ${INPUT} ${OUTPUT}
            </command>
            <metadata>
              <match field="content_ascii" valueGroup="1">(.*)</match>
            </metadata>
          </transformer>

could you please provide me with some help, with this issue?

thanks a lot angelo

essiembre commented 6 years ago

Which version are you using? I recommend you try 2.8.0 (snapshot) because there were significant improvements made to the ExternalTransformer.

I am not sure I understand your issue. Do you want the content as a field? Because right now, by having ${OUTPUT} in your command, you are telling it to grab the content from a file, so the importer will treat it as a file, not a field. What is the output of your extract.cmd? Do you write to a file (path given as the second argument), or do you write to STDOUT (console)?

angelo337 commented 6 years ago

Pascal: thanks for your fast Answer. Do you want the content as a field? I would like to have the output as a Content field, What is the output of your extract.cmd? HTML content, like in the following output:

 "content_ascii_txt":["--- OVM Domain: DEF ---",
          "--- OVM Version 9.0.0 --- BP:0 TL:0 DB:0 FS:0 RE:4",
          "--- UTL_MEM Version 3.0.0 --- Dbg msg -",
          "--- ValueTypes 9.0 ---",
          "%FIO-W-SFMNEX, IRREGULAR PRIMARY INDEX 4477.5 EXPECTED 4489. LU 1 FT 1",
          "[14 NOV 2017 18:04:31] ODL-I-Input       C:\\DOCUME~1\\ADMINI~1\\LOCALS~1\\Temp\\input7951598160920256695.tmp opened",
          "[14 NOV 2017 18:04:31] ODL-S-OpenFileRead Opened SU C:\\DOCUME~1\\ADMINI~1\\LOCALS~1\\Temp\\input7951598160920256695.tmp FILE-ID: 305146.001 of SSet LIS to DLIS Conversion on LU 1",
          "[14 NOV 2017 18:04:31] ODL-I-FrameReadErrors 1 error(s) were encountered during reading of frame data.",
          "<html><head><meta http-equiv=\"Content-Type\"",
          "content=\"text/html; charset=iso-8859-1\">",
          "<meta name=\"GENERATOR\" content=\"Schlumberger DlisView 18C0-148\">",
          "<title>Verification Listing</title></head>",
          "<body bgcolor=\"#FFFFFF\">",
          "<table border=\"1\" cellspacing=\"0\" width=\"100%\">",

Do you write to a file (path given as the second argument), or do you write to STDOUT (console)? I am writing the content to a file however the name of the file is not know in advance because the program run on CMD generate a file internally; after run I just "type" any output with an extension of HTML to the STDOUT (console).

Also as you can see is not in a single field value all the content but in several records that are defined by a "\n" character, is it possible to have all in a single record?

I hope that clarify my situation a little bit more.

thanks angelo

essiembre commented 6 years ago

You did not specify which version you were using, but assuming the latest snapshot, I think I know how to accomplish what you want. If you have no out file, remove ${OUTPUT} from your command. Then your metadata pattern matching should work. They will work against each line returned, but fear not, each matching line will be stored as a separate entry, creating a multi-value field (array). If you just want a single value field, you can merge all values obtained with MergeTagger or ForceSingleValueTagger. One example for your content_ascii field:

  <tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
      <merge toField="content_ascii" singleValue="true" singleValueSeparator=" ">
        <fromFields>content_ascii</fromFields>
      </merge>
  </tagger>
angelo337 commented 6 years ago

My apologies, I am working with 2.8. It is working with your changes and some more config from my side, in order to avoid crawling from the document parser, I include this config:

` <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">

application/octet-stream
    </documentParserFactory>`

thanks for you help angelo