Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

ScriptTagger and Nashorn engine #89

Open danizen opened 5 years ago

danizen commented 5 years ago

I am updating my code to work under OpenJDK 11, as soon Oracle will stop supporting Java 8, and my institution, as government may be expected to do, is moving on.

After some adjustments, my tests mostly work, but I say the following error message"Warning: Nashorn engine is planned to be removed from a future JDK release" in a verify test that runs with the actual importer configuration I use in production. The problem I gather is the ScriptTagger.

JEP 335 states that the Nashorn engine will be removed from a future release. Long term, that's probably a good thing, but the ScriptTagger defaults to using the Nashorn engine, and so work should be done to find a better alternative default ECMA script implementation so that importer configurations similar to the following continue to work:

    <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
      <script><![CDATA[
        /* create a domain field */
        var expr = new RegExp('[a-z]+://([^/]+).*');
        var url = metadata.url[0];
        var domain = url.replace(expr, '$1');
        metadata.addString('domain', domain);

        /* if keywords is not a list, make it one */
        if (metadata.containsKey('keywords')) {
           var keywords = metadata.get('keywords');
           if (typeof keywords == 'string') {
              metadata.set('keywords', [keywords])
           }
        }

        /* Clean the schemaorg_itemtype variables */
        if (metadata.containsKey('schemaorg_itemtype')) {
          var newdata = new java.util.ArrayList();
          var data = metadata.get('schemaorg_itemtype');
          for each (var datum in data) {
            newdata.add(datum.replaceFirst('^https?\://schema.org/', ''));
          }
          metadata.put('schemaorg_itemtype', newdata);
        }
      ]]></script>
    </tagger>
danizen commented 5 years ago

Probably the best thing is to bow to the inevitable and to include Groovy with groovy as the default script engine implementation.

danizen commented 5 years ago

Another option, although slower: https://search.maven.org/artifact/org.mozilla/rhino/1.7.10/jar

And then there is GraalVM's JS engine, but that looks a little harder to add.

essiembre commented 5 years ago

Given GraalVM comes standard with JDK 11, and supports backward compatibility with Nashorn, it looks like a more than a suitable alternative (being a more complete implementation and being more efficient).

That being said, it does not prevent from adding Groovy as an option.

danizen commented 5 years ago

Does Graal come standard with OpenJDK 11? That means, I don't need to go through the jazz in:

https://medium.com/graalvm/graalvms-javascript-engine-on-jdk11-with-high-performance-3e79f968a819 https://github.com/graalvm/graal-js-jdk11-maven-demo/blob/master/pom.xml

danizen commented 5 years ago

The OpenJDK 11 I have, from Azul.com, is a supported version we pay for, because Oracle will not offer support on JDK 11 long enough for us. This version at least does not include Graaljs.

Another significant issue for performance, is that the ScriptEngine cannot compile the code ahead of time and then invoke an Invokable within the evaluation code as is done in the Maven repo. linked above.

So, then, compilation is done for each invocation of the ScriptEngine. I guess that is OK.

How I checked for Graal on my JDK:

jshell> import javax.script.ScriptEngine;

jshell> import javax.script.ScriptEngineFactory;

jshell> import javax.script.ScriptEngineManager;

jshell> var manager = new ScriptEngineManager();
manager ==> javax.script.ScriptEngineManager@48974e45

jshell> List<ScriptEngineFactory> factories = manager.getEngineFactories()
factories ==> [jdk.nashorn.api.scripting.NashornScriptEngineFactory@6d3a388c]

jshell> for (var factory : factories) {
   ...>    println(factory.getEngineName());
   ...>    println(factory.getEngineVersion());
   ...>    println(factory.getLanguageName());
   ...>    println(factory.getLanguageVersion());
   ...>    println(factory.getExtensions());
   ...>    println(factory.getMimeTypes());
   ...>    println(factory.getNames());
   ...>    println("");
   ...> }
Oracle Nashorn
11.0.2
ECMAScript
ECMA - 262 Edition 5.1
[js]
[application/javascript, application/ecmascript, text/javascript, text/ecmascript]
[nashorn, Nashorn, js, JS, JavaScript, javascript, ECMAScript, ecmascript]