EclairJS / eclairjs-nashorn

JavaScript API for Apache Spark
Apache License 2.0
94 stars 11 forks source link

Need a way to serialize Spark objects back to eclairjs-node #138

Open doronrosenberg opened 8 years ago

doronrosenberg commented 8 years ago

@billreed63 @bpburns Trying to port over mllib FP Growth and the issue is that there is a collect() call than returns an array of FreqItemset (which is a nested class). Since we are calling JSON.stringify currently all we get is an array of values like this:

org.apache.spark.mllib.fpm.FPGrowth$FreqItemset@5026dbbf

What we need is a better serializer that lives in nashorn that would return us something like this:

{__eclairJSClass: mllib.fpm.FPGrowth$FreqItemset}

And then on the Node.js side we would know to create a wrapper of the correct type.

Any thoughts before I go implement this?

billreed63 commented 8 years ago

How would that help get the values of the FreqItemset back to node? Don't we need to add a toJSON method to the FreqItemset to convert the object contents to JSON?

doronrosenberg commented 8 years ago

@billreed63 My thought was to do this in a general way that doesn't need to do JSON conversion. Basically the Node.js code would generate:

var collection = rdd.collect();
ourStringify(collection)

which would return:

[
{__eclairJSClass: mllib.fpm.FPGrowth$FreqItemset},
...
]

We can then create a FreqItemset object in Node that references collection[0], collection[1], etc. These would be remote proxies to the collection variable that holds all our FreqItemsets.

billreed63 commented 8 years ago

We would need to make a request across for each item, Can we just add:

FreqItemset.prototype.toJSON = function() {
    var json = {};
    json.freq = this.freq();
    json.items = this.items();

    return json;
};

and

List.prototype.toJSON= function() {
    return this.getJavaObject().toString();

};

then

  var x = JSON.stringify(result);

gives us:

[{"freq":3,"items":"[t]"},{"freq":3,"items":"[t, x]"},{"freq":3,"items":"[t, x, z]"},{"freq":3,"items":"[t, z]"},{"freq":3,"items":"[s]"},{"freq":2,"items":"[s, t]"},{"freq":2,"items":"[s, t, x]"},{"freq":2,"items":"[s, t, x, z]"},{"freq":2,"items":"[s, t, z]"},{"freq":3,"items":"[s, x]"},{"freq":2,"items":"[s, x, z]"},{"freq":2,"items":"[s, z]"},{"freq":2,"items":"[p]"},{"freq":2,"items":"[p, r]"},{"freq":2,"items":"[p, r, z]"},{"freq":2,"items":"[p, z]"},{"freq":5,"items":"[z]"},{"freq":3,"items":"[y]"},{"freq":3,"items":"[y, t]"},{"freq":3,"items":"[y, t, x]"},{"freq":3,"items":"[y, t, x, z]"},{"freq":3,"items":"[y, t, z]"},{"freq":2,"items":"[y, s]"},{"freq":2,"items":"[y, s, t]"},{"freq":2,"items":"[y, s, t, x]"},{"freq":2,"items":"[y, s, t, x, z]"},{"freq":2,"items":"[y, s, t, z]"},{"freq":2,"items":"[y, s, x]"},{"freq":2,"items":"[y, s, x, z]"},{"freq":2,"items":"[y, s, z]"},{"freq":3,"items":"[y, x]"},{"freq":3,"items":"[y, x, z]"},{"freq":3,"items":"[y, z]"},{"freq":2,"items":"[q]"},{"freq":2,"items":"[q, t]"},{"freq":2,"items":"[q, t, x]"},{"freq":2,"items":"[q, t, x, z]"},{"freq":2,"items":"[q, t, z]"},{"freq":2,"items":"[q, y]"},{"freq":2,"items":"[q, y, t]"},{"freq":2,"items":"[q, y, t, x]"},{"freq":2,"items":"[q, y, t, x, z]"},{"freq":2,"items":"[q, y, t, z]"},{"freq":2,"items":"[q, y, x]"},{"freq":2,"items":"[q, y, x, z]"},{"freq":2,"items":"[q, y, z]"},{"freq":2,"items":"[q, x]"},{"freq":2,"items":"[q, x, z]"},{"freq":2,"items":"[q, z]"},{"freq":4,"items":"[x]"},{"freq":3,"items":"[x, z]"},{"freq":3,"items":"[r]"},{"freq":2,"items":"[r, x]"},{"freq":2,"items":"[r, z]"}]
doronrosenberg commented 8 years ago

For FreqItemset that would work, but what if it had a method that did some calculations? I was thinking more of a general solution. Indeed we would have to make a request across for each item.

I guess for collect()/take() that might be enough though to just have toJSON(), but we would have to implement that everywhere

billreed63 commented 8 years ago

Do we really have a need for a remote invocation model? I don't see Spark as an interactive thing, seems like to is more of a batch thing.

doronrosenberg commented 8 years ago

We already have to do this for randomSplit that returns an array of RDDs. For collect/take though I think returning a JSON blob could be enough.