combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 311 forks source link

CountVectorizer need explicit conversion between Java and Scala #508

Closed FredYao closed 4 years ago

FredYao commented 5 years ago

I am training a CountVectorizerModel in Spark, where it takes an array of strings as a input feature value. I then serialized the model using MLeap, and deserialize it in a pure Java environment and use it as a transformer. However, the transformer (in Java) was not able to recognize Java's String array. According to the reported error, I have to explicitly convert the Java string array to a Scala string list. I think this made things a little subtle. People will have to explicitly do conversion before they feed the input into a MLeap model. Without printing the MLeap model schema, it's hard to know that the model can actually accept. If the model training uses an array of string as input, while scoring, intuitively, people would also input an array of string. Is it possible to have MLeap handle the Java-Scala conversion implicitly?

ancasarb commented 5 years ago

Hey @FredYao, to avoid having to convert the Java array to Scala, you could wrap your CountVectorizer in Spark in a pipeline and serialize the model as a pipeline with 1 transformer. Then, in Java, you can score using the pipeline model, which scores using a leap frame.

We've added quite a bit of support for working with leap frames in Java, you can take a look at the tests here https://github.com/combust/mleap/blob/master/mleap-runtime/src/test/scala/ml/combust/mleap/runtime/javadsl/JavaDSLSpec.java.

https://github.com/combust/mleap/blob/master/mleap-runtime/src/test/scala/ml/combust/mleap/runtime/javadsl/JavaDSLSpec.java#L48 shows you how to create a leap frame. You'll notice that when creating the row, you can pass a Java array without any issues. https://github.com/combust/mleap/blob/master/mleap-runtime/src/test/scala/ml/combust/mleap/runtime/javadsl/JavaDSLSpec.java#L36

Hope this helps, Anca

ancasarb commented 4 years ago

Closing this in preparation of release 0.16.0, hopefully the clarification above makes sense, please re-open the issue if not. Thank you!