Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

ExternalTransformer usage #61

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

Hi there: I am trying to use the ExternalTransformer on some documents, I create a SH file that is executable an receive a text file and transform it to something else, in the command line us working just fine, however I don't know to pass the info from the content metadata in post parse to request, as in the documentation said:

c:\Apps\myapp.exe ${INPUT} ${OUTPUT}

in my case is:

extract.sh content content could be an URL or a metadata parameter to process.

when I try to invoke like this

extract.sh $content or extract.sh ${content}

I am getting the same Error:

ERROR [SystemCommand] Command returned with exit value 2 (command properly escaped?). Command: extract.sh $content Error: ""

could you please help me with that? Best Regards Angelo

essiembre commented 7 years ago

${INPUT} and ${OUTPUT} are optional, but when provided, they must be typed "as-is" in your "command" configuration and the importer will replace them with file paths. The input file is created by the importer and holds the content. The output file is for you to create so the Importer can pick it up (it holds your transformation.

So extract.sh $content is not valid. extract.sh ${INPUT} ${OUTPUT} is.

You can obtain the paths using positional parameters. E.g.:

#!/bin/sh
echo "Path to input file: $1"
echo "Path to output file: $2"

If you do not put the ${INPUT} and ${OUTPUT} placeholders, you will need to rely on STDIN to read the file, and STDOUT to write it out.

Be careful when you invoke this transformer. For example, if your shell script only deals with text fields, make sure you configure it as a post-parse handler.

Any clearer?

angelo337 commented 7 years ago

Pascal: I just made my SH script and configuration as told before and I am getting an error:

INFO [CrawlerEventManager] REJECTED_ERROR: http://www.elespectador.com/noticias/el-mundo/terremoto-de-magnitud-71-sacude-la-capital-de-mexico-articulo-713927 (java.lang.NullPointerException) ERROR [AbstractCrawler] SOC_website: Could not process document: http://www.elespectador.com/noticias/el-mundo/terremoto-de-magnitud-71-sacude-la-capital-de-mexico-articulo-713927 (null) java.lang.NullPointerException at com.norconex.importer.handler.transformer.impl.ExternalTransformer.newInputFile(ExternalTransformer.java:486) at com.norconex.importer.handler.transformer.impl.ExternalTransformer.transformApplicableDocument(ExternalTransformer.java:394) at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:56) at com.norconex.importer.Importer.transformDocument(Importer.java:544) at com.norconex.importer.Importer.executeHandlers(Importer.java:347) at com.norconex.importer.Importer.importDocument(Importer.java:316) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

this is my configure:

              <transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
               <command> extract.sh ${INPUT} ${OUTPUT} </command>
          </transformer>
         </postParseHandlers>

also here is my external script:

!/bin/sh

echo "source $1" echo "destination $2"

Could you please point me out my mistake? thanks

essiembre commented 7 years ago

The NullPointerException is now fixed in the latest Importer snapshot.

Besides that, I see two potential issues:

1) Try putting the full path to your extract.sh file.

2) If that is all your shell script is doing, that won't work. The echo statements in my example were just for you to see what the argument value will be paths. You actually need to read the input file and store your transformed version into the output file location. Here is an example that will take the input file and sort every line into the output file:

#!/bin/sh
sort $1 > $2
angelo337 commented 7 years ago

Pascal: I just tested as you directed and work perfect, but not the same happen with ExternalParser, with the same configuration don't work. i have to continue testing and let you know Thanks a lot for your time and patience angelo

essiembre commented 7 years ago

How have you configured the ExternalParser?

angelo337 commented 7 years ago

Pascal: sorry for my delay answer, here is my configuration,

  <parsers>
        <parser contentType="text/html" 
                class="com.norconex.importer.parser.impl.ExternalParser" >
        <command> ./extract.sh </command>
     </parser>
  </parsers>

extract.sh have the same config that the one that is running on the other part. thanks a lot

essiembre commented 7 years ago

Can you share your entire config? Trying to reproduce. In the meantime, have you tried with the full path and using the placeholders?

angelo337 commented 7 years ago

Pascal: for sure I can share that, when i try to use INPUT OUTPUT id does not work, with the file attach is working in preparse, parser and postparse however when I try to include parameter input output that does not work. also another problem I am having is, with this configuration I miss most of the metadata inside the document itself, is keeping all http headers but not the Title or Twitter data among others. at the end I manage to make the boiler pipe work using this configuration and behave as expected on Preparse or Parse, not on Postparse thanks

test-config_sh.xml.zip

essiembre commented 7 years ago

is working in preparse, parser and postparse

So it is working?

For metadata, you can configure <metadata><pattern ... to extract them.

angelo337 commented 7 years ago

Pascal: it is working with: <command> /home/creangel/Downloads/solr/norconex-collector-http-2.8.0-SNAPSHOT/extract.sh </command>

not work with: <command> /home/creangel/Downloads/solr/norconex-collector-http-2.8.0-SNAPSHOT/extract.sh ${INPUT} ${OUTPUT}</command>

essiembre commented 7 years ago

It is normal to work with only one of the two. Without the placeholders, the content itself is sent/expected. With them, you only get paths to the input file and expected output file.

angelo337 commented 7 years ago

thanks a lot for the explanation, now I can work with boilerpipe and make it easier to find relevant documents. best regards angelo