Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
182 stars 68 forks source link

How to modify document content with Javascript? #665

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

I want to modify the content of the document with Javascript code. For this I use ScriptTagger and do this:

<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
   <script><![CDATA[
      // Document content
      var Contenttype = metadata.getString('document.contentType');
      if (Contenttype != null && Contenttype == 'text/html') {
         if (content != null) {
            // Test
            content = '=> TEST: ' + content;
         }
      }
   ]]></script>
</tagger>

In the output file (I use JSONFileCommitter) the content value has not changed. '=> TEST: ' is not present in content.

{"doc-add": {
     "reference": "document reference, e.g., URL",
     "metadata": { ... },
     "content": "Initial Document Content without '=> TEST: '"
}}
essiembre commented 4 years ago

I tried your script with sample data and it works fine for me. I suspect the content-type is not what your code expect. You can find out by assigning it to your content just for testing:

var Contenttype = metadata.getString('document.contentType');
content += ' Contenttype: ' + Contenttype;
...
LeMoussel commented 4 years ago

Strange .... it's not working for me. Here's my config test: testConfigHttpCollector.xml.txt Running under windows with testCollector-http.bat: testCollector-http.bat.txt

in the JSON file under .\output-test\crawledFilesJSON for the variable content we have as a result:

[
  {
  "doc-add": {
    "reference": "http://httpbin.org/forms/post",
    "metadata": {
      "collector.referrer-link-text": [
        "HTML form"
      ],
      "collector.referrer-reference": [
        "http://httpbin.org/"
      ],
      "collector.depth": [
        "1"
      ]
    },
    "content": "Customer name: Telephone: E-mail address: Pizza Size Small Medium Large Pizza Toppings Bacon Extra Cheese Onion Mushroom Preferred delivery time: Delivery instructions: Submit order"
  }
},
  {
  "doc-add": {
    "reference": "http://httpbin.org/",
    "metadata": {
      "title": [
        "httpbin.org"
      ],
      "collector.depth": [
        "0"
      ],
      "collector.referenced-urls": [
        "http://httpbin.org/forms/post"
      ]
    },
    "content": "httpbin.org 0.9.2 [ Base URL: httpbin.org/ ] A simple HTTP Request & Response Service. Run locally: $ docker run -p 80:80 kennethreitz/httpbin the developer - Website Send email to the developer [Powered by Flasgger] Other Utilities HTML form that posts to /post /forms/post"
  }
}
]

According to the attached configuration file it should be :

"content": "Contenttype: [value of metadata document.contentType]Customer name: Telephone: E-mail .... "content": "Contenttype: [value of metadata document.contentType]httpbin.org 0.9.2 [ Base URL: httpbin.org/ ] .....

What am I missing? Thanks for your help!

essiembre commented 4 years ago

Taggers cannot modify the content, only metadata. Modifying content is done using Transformers. This works:

<transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
    <script><![CDATA[
        var Contenttype = metadata.getString('document.contentType');
        content = 'Contenttype: [' + Contenttype + ']' + content;
    ]]></script>
</transformer>      
LeMoussel commented 4 years ago

Thank you!