Use Java 8 Nashorn engine for Javascript transformers

datacleaner / DataCleaner

The premier open source Data Quality solution

GNU Lesser General Public License v3.0

600 stars 181 forks source link

Use Java 8 Nashorn engine for Javascript transformers #1105

Open kaspersorensen opened 8 years ago

kaspersorensen commented 8 years ago

Now that we've upgraded to use Java 8, we should also be able to use Nashorn instead of the old Rhino JS engine. It should be a lot faster and will allow us to trim our depebdencies too.

LosD commented 8 years ago

Nice. I guess the actual code change is pretty minor?

kaspersorensen commented 8 years ago

I have no clue :-) this just popped up in my head as an optimization we could do. So I wanted to record it for now.

LosD commented 8 years ago

The biggest difference seems to be the access of Java types, so we may have a problem if we need it to be backwards compatible with users' scripts: https://wiki.openjdk.java.net/display/Nashorn/Rhino+Migration+Guide

If not, it seems to be rather simple.

kaspersorensen commented 8 years ago

It should be able to run at least the typical user's scripts in the same way.

Currently (using Rhino) the user can for instance iterate over the "values" array that is provided as a normal javascript array. Furthermore they might have a MAP or LIST type object which uses Java signature. For instance:

var list = values[0];
for (var i=0; i<list.size(); i++) {
  var elem = list.get(i);
}

Notice the use of .size() and .get(i) in that snippet. That's because in this case the script was built to traverse a LIST data type field in DataCleaner.

LosD commented 8 years ago

Yeah, I'm looking at it now.

There are some challenges:

That mixed mixed map and array values monster doesn't seem very easy to change to Nashorn.
Second, precompiled Nashorn is intentionally not thread safe (JavaScript has no concept of threads), so we need to do some locking on the CompiledScript if we don't want to eval() for every row.

LosD commented 8 years ago

By the way, how come the old version uses Rhino directly? It has been integrated into javax.script since Java 6.

kaspersorensen commented 8 years ago

It's not that values is mixed in any way - it's an (Object) array. But a use case that I'd like to keep supporting is that one of the values (for instance at position 0 in my example) would be a Map or a List. So then you should be able to use the Java API for List or Map to traverse or retrieve from that value.

I think also Rhino was not thread safe, so some synchronization is going on in the transformer.

I think it was simply built for Rhino because that seemed to be the easiest at the time and there was never any reason to change that. It might also be that some specifics of the Rhino API is more powerful/precise than the javax API?

LosD commented 8 years ago

Hmmmm... I looked at the code: It uses a NativeArray Rhino class, double the size of the array. All elements are added to it by index and by name. It's pretty nasty, but I can see the convenience.

As far as I can see with a quick glance, Nashorn and Rhino seems to agree pretty much on how Java objects are consumed from JS, it's mostly instantiating them that's really different.

Rhino has a way to get a thread-specific context (ContextFactory.enterContext()), that javax.script doesn't seem to have (by the way, that may very much be the reason Rhino directly without javax.script was chosen), however looking at it, I don't think it's a problem: No JS objects are shared between threads/scopes, only Java objects that in themselves are thread safe (logger and System.out).

LosD commented 8 years ago

I have something working now (only the simple transformer for now) by ignoring the name key.

I think we can make the name key work by either using Nashorn's (internal!) NativeArray, or by making a JavaScript function that does the wrapping. None of them seem very attractive :unamused: