Closed angelo337 closed 7 years ago
${INPUT}
and ${OUTPUT}
are optional, but when provided, they must be typed "as-is" in your "command" configuration and the importer will replace them with file paths. The input file is created by the importer and holds the content. The output file is for you to create so the Importer can pick it up (it holds your transformation.
So extract.sh $content
is not valid. extract.sh ${INPUT} ${OUTPUT}
is.
You can obtain the paths using positional parameters. E.g.:
#!/bin/sh
echo "Path to input file: $1"
echo "Path to output file: $2"
If you do not put the ${INPUT}
and ${OUTPUT}
placeholders, you will need to rely on STDIN
to read the file, and STDOUT
to write it out.
Be careful when you invoke this transformer. For example, if your shell script only deals with text fields, make sure you configure it as a post-parse handler.
Any clearer?
Pascal: I just made my SH script and configuration as told before and I am getting an error:
INFO [CrawlerEventManager] REJECTED_ERROR: http://www.elespectador.com/noticias/el-mundo/terremoto-de-magnitud-71-sacude-la-capital-de-mexico-articulo-713927 (java.lang.NullPointerException) ERROR [AbstractCrawler] SOC_website: Could not process document: http://www.elespectador.com/noticias/el-mundo/terremoto-de-magnitud-71-sacude-la-capital-de-mexico-articulo-713927 (null) java.lang.NullPointerException at com.norconex.importer.handler.transformer.impl.ExternalTransformer.newInputFile(ExternalTransformer.java:486) at com.norconex.importer.handler.transformer.impl.ExternalTransformer.transformApplicableDocument(ExternalTransformer.java:394) at com.norconex.importer.handler.transformer.AbstractDocumentTransformer.transformDocument(AbstractDocumentTransformer.java:56) at com.norconex.importer.Importer.transformDocument(Importer.java:544) at com.norconex.importer.Importer.executeHandlers(Importer.java:347) at com.norconex.importer.Importer.importDocument(Importer.java:316) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
this is my configure:
<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer"> <command> extract.sh ${INPUT} ${OUTPUT} </command> </transformer> </postParseHandlers>
also here is my external script:
!/bin/sh
echo "source $1" echo "destination $2"
Could you please point me out my mistake? thanks
The NullPointerException is now fixed in the latest Importer snapshot.
Besides that, I see two potential issues:
1) Try putting the full path to your extract.sh file.
2) If that is all your shell script is doing, that won't work. The echo
statements in my example were just for you to see what the argument value will be paths. You actually need to read the input file and store your transformed version into the output file location. Here is an example that will take the input file and sort every line into the output file:
#!/bin/sh
sort $1 > $2
Pascal: I just tested as you directed and work perfect, but not the same happen with ExternalParser, with the same configuration don't work. i have to continue testing and let you know Thanks a lot for your time and patience angelo
How have you configured the ExternalParser
?
Pascal: sorry for my delay answer, here is my configuration,
<parsers>
<parser contentType="text/html"
class="com.norconex.importer.parser.impl.ExternalParser" >
<command> ./extract.sh </command>
</parser>
</parsers>
extract.sh have the same config that the one that is running on the other part. thanks a lot
Can you share your entire config? Trying to reproduce. In the meantime, have you tried with the full path and using the placeholders?
Pascal: for sure I can share that, when i try to use INPUT OUTPUT id does not work, with the file attach is working in preparse, parser and postparse however when I try to include parameter input output that does not work. also another problem I am having is, with this configuration I miss most of the metadata inside the document itself, is keeping all http headers but not the Title or Twitter data among others. at the end I manage to make the boiler pipe work using this configuration and behave as expected on Preparse or Parse, not on Postparse thanks
is working in preparse, parser and postparse
So it is working?
For metadata, you can configure <metadata><pattern ...
to extract them.
Pascal:
it is working with:
<command> /home/creangel/Downloads/solr/norconex-collector-http-2.8.0-SNAPSHOT/extract.sh </command>
not work with:
<command> /home/creangel/Downloads/solr/norconex-collector-http-2.8.0-SNAPSHOT/extract.sh ${INPUT} ${OUTPUT}</command>
It is normal to work with only one of the two. Without the placeholders, the content itself is sent/expected. With them, you only get paths to the input file and expected output file.
thanks a lot for the explanation, now I can work with boilerpipe and make it easier to find relevant documents. best regards angelo
Hi there: I am trying to use the ExternalTransformer on some documents, I create a SH file that is executable an receive a text file and transform it to something else, in the command line us working just fine, however I don't know to pass the info from the content metadata in post parse to request, as in the documentation said:
in my case is:
when I try to invoke like this
I am getting the same Error:
could you please help me with that? Best Regards Angelo