Closed angelo337 closed 6 years ago
Which version are you using? I recommend you try 2.8.0 (snapshot) because there were significant improvements made to the ExternalTransformer.
I am not sure I understand your issue. Do you want the content as a field? Because right now, by having ${OUTPUT}
in your command, you are telling it to grab the content from a file, so the importer will treat it as a file, not a field. What is the output of your extract.cmd
? Do you write to a file (path given as the second argument), or do you write to STDOUT (console)?
Pascal: thanks for your fast Answer. Do you want the content as a field? I would like to have the output as a Content field, What is the output of your extract.cmd? HTML content, like in the following output:
"content_ascii_txt":["--- OVM Domain: DEF ---",
"--- OVM Version 9.0.0 --- BP:0 TL:0 DB:0 FS:0 RE:4",
"--- UTL_MEM Version 3.0.0 --- Dbg msg -",
"--- ValueTypes 9.0 ---",
"%FIO-W-SFMNEX, IRREGULAR PRIMARY INDEX 4477.5 EXPECTED 4489. LU 1 FT 1",
"[14 NOV 2017 18:04:31] ODL-I-Input C:\\DOCUME~1\\ADMINI~1\\LOCALS~1\\Temp\\input7951598160920256695.tmp opened",
"[14 NOV 2017 18:04:31] ODL-S-OpenFileRead Opened SU C:\\DOCUME~1\\ADMINI~1\\LOCALS~1\\Temp\\input7951598160920256695.tmp FILE-ID: 305146.001 of SSet LIS to DLIS Conversion on LU 1",
"[14 NOV 2017 18:04:31] ODL-I-FrameReadErrors 1 error(s) were encountered during reading of frame data.",
"<html><head><meta http-equiv=\"Content-Type\"",
"content=\"text/html; charset=iso-8859-1\">",
"<meta name=\"GENERATOR\" content=\"Schlumberger DlisView 18C0-148\">",
"<title>Verification Listing</title></head>",
"<body bgcolor=\"#FFFFFF\">",
"<table border=\"1\" cellspacing=\"0\" width=\"100%\">",
Do you write to a file (path given as the second argument), or do you write to STDOUT (console)? I am writing the content to a file however the name of the file is not know in advance because the program run on CMD generate a file internally; after run I just "type" any output with an extension of HTML to the STDOUT (console).
Also as you can see is not in a single field value all the content but in several records that are defined by a "\n" character, is it possible to have all in a single record?
I hope that clarify my situation a little bit more.
thanks angelo
You did not specify which version you were using, but assuming the latest snapshot, I think I know how to accomplish what you want. If you have no out file, remove ${OUTPUT}
from your command. Then your metadata pattern matching should work. They will work against each line returned, but fear not, each matching line will be stored as a separate entry, creating a multi-value field (array). If you just want a single value field, you can merge all values obtained with MergeTagger
or ForceSingleValueTagger
. One example for your content_ascii
field:
<tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
<merge toField="content_ascii" singleValue="true" singleValueSeparator=" ">
<fromFields>content_ascii</fromFields>
</merge>
</tagger>
My apologies, I am working with 2.8. It is working with your changes and some more config from my side, in order to avoid crawling from the document parser, I include this config:
` <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
</documentParserFactory>`
thanks for you help angelo
Hi there,
I am trying to extract content from a file type very special, and I manage to convert it to HTML, however when i try to put al te information back from the output all information goes to a custom field because "content" field never pass at all, is it possible to clean HTML with a custom defined field?
here is my config:
could you please provide me with some help, with this issue?
thanks a lot angelo