combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 310 forks source link

OneVsRest Issue with MLeap 0.14.0 , 0.15.0 #603

Open hdamani09 opened 4 years ago

hdamani09 commented 4 years ago

Scala Version - 2.11 Spark Version - 2.4.3

Hi, I have a pipeline configured with the following sparkML transformers & algorithms. I have been trying to solve a multiclassification problem using LinearSVC & OneVSRest.

   val stringIndexer = new org.apache.spark.ml.feature.StringIndexer().setInputCol("label").setOutputCol("label_indexed").setHandleInvalid("skip")
    val tokenizer = new org.apache.spark.ml.feature.Tokenizer().setInputCol("text").setOutputCol("words")
    val hashingTF = new org.apache.spark.ml.feature.HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
    val lsvc = new org.apache.spark.ml.classification.LinearSVC().setLabelCol("label_indexed").setFeaturesCol("features")
    val mleapOVR = new org.apache.spark.ml.mleap.classification.OneVsRest().setClassifier(lsvc).setLabelCol("label_indexed").setFeaturesCol("features")
    val sparkmlOVR = new org.apache.spark.ml.classification.OneVsRest().setClassifier(lsvc).setLabelCol("label_indexed").setFeaturesCol("features")  
    val indexToString = new org.apache.spark.ml.feature.IndexToString().setInputCol("prediction").setOutputCol("predictedCategory").setLabels(stringIndexer.fit(df).labels)

    val model = new Pipeline().setStages(Array(stringIndexer, tokenizer, hashingTF, mleapOVR, indexToString)).fit(df)
    val predictionDf = model.transform(df)
    predictionDf.show()

+---+--------------------+------+-------------+--------------------+--------------------+------------------------------------------+----------+------------------+-----------------+
| id|                text| label|label_indexed|               words|            features|mbc$lpc539bff7-6d99-4077-bfb1-ce2fea8b632b|prediction|       probability|predictedCategory|
+---+--------------------+------+-------------+--------------------+--------------------+------------------------------------------+----------+------------------+-----------------+
|  0|a table-tennis an...|sports|          0.0|[a, table-tennis,...|(1000,[170,273,33...|                      [0, 1.45738526075...|         0|  1.45738526075102|           sports|
|  1|            pool c d|sports|          0.0|        [pool, c, d]|(1000,[94,722,860...|                      [0, 2.06780485158...|         0|2.0678048515897625|           sports|
|  2|  h cricket f hockey|sports|          0.0|[h, cricket, f, h...|(1000,[2,220,248,...|                      [0, 2.06780485158...|         0|2.0678048515897625|           sports|
|  3|  alien i u predator|movies|          1.0|[alien, i, u, pre...|(1000,[329,352,52...|                      [1, 1.44052588695...|         1|1.4405258869594206|           movies|
|  4|  terminator is back|movies|          1.0|[terminator, is, ...|(1000,[182,281,43...|                      [1, 2.34671134516...|         1|2.3467113451695374|           movies|
|  5|       v r gladiator|movies|          1.0|   [v, r, gladiator]|(1000,[248,477,57...|                      [1, 1.53089091952...|         1|1.5308909195231362|           movies|
|  6|hi badminton hey ...|sports|          0.0|[hi, badminton, h...|(1000,[170,559,92...|                      [0, 1.63159124054...|         0|1.6315912405409212|           sports|
|  7|        rambo is lit|movies|          1.0|    [rambo, is, lit]|(1000,[281,661,81...|                      [1, 2.34671134516...|         1|2.3467113451695374|           movies|
|  8| you chess me hockey|sports|          0.0|[you, chess, me, ...|(1000,[2,425,471,...|                      [0, 1.34536377753...|         0|1.3453637775364258|           sports|
|  9|    we play checkers|sports|          0.0|[we, play, checkers]|(1000,[173,902,99...|                      [0, 2.17738374250...|         0|2.1773837425059606|           sports|
| 10|       be like rocky|movies|          1.0|   [be, like, rocky]|(1000,[330,656,74...|                      [1, 2.27254585192...|         1| 2.272545851928956|           movies|
| 11|           i m joker|movies|          1.0|       [i, m, joker]|(1000,[36,329,638...|                      [1, 2.16296696101...|         1|2.1629669610127578|           movies|
+---+--------------------+------+-------------+--------------------+--------------------+------------------------------------------+----------+------------------+-----------------+

I need to serialize the pipelineModel to deploy it on sagemaker. So, if I try to use sparkMLOVR instance in the pipeline it doesn't work. I switched to mLeapOVR instance and used the following code to serialize the model to zip :

val simpleSparkSerializerObj = new SimpleSparkSerializer
    simpleSparkSerializerObj.serializeToBundleWithFormat(model, "jar:file:/C:/tmp/ovr.zip", predictionDf, SerializationFormat.Json)

Issue with 0.14.0 while serializing : The zip that it creates doesn't have the model.json artifacts for OneVsRest and that directory is empty. So, if I try to deserialize it back using the following code, it throws an Exception :

val bundle = (for (bundleFile <- managed(BundleFile("jar:file:/C:/tmp/ovr.zip"))) yield { bundleFile.loadSparkBundle().get }).opt.get.root

Exception in thread "main" java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)

If I use MLeap 0.15.0 to serialize & deserialize the zip back to a transformer, it doesn't throw the above exception. But when I try to transform it on the df, it throws the following exception :

bundle.transform(df).show()
Exception in thread "main" java.util.NoSuchElementException: Failed to find a default value for classifier
    at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
    at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779)
    at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
    at org.apache.spark.ml.param.Params$class.$(params.scala:786)
    at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
    at org.apache.spark.ml.mleap.classification.OneVsRestParams$class.getClassifier(OneVsRest.scala:70)
    at org.apache.spark.ml.mleap.classification.OneVsRestModel.getClassifier(OneVsRest.scala:137)
    at org.apache.spark.ml.mleap.classification.OneVsRestModel.transformSchema(OneVsRest.scala:154)
    at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
    at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
    at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
    at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)

Is there an issue with OneVsRest provided by MLeap Spark Extension 0.15.0 for 2.4.3?

A quick response and resolution would be much appreciated. Thank you

WeichenXu123 commented 4 years ago

This is because mleap does not serialize classifier param, but in model transform it need check classifier.featuresDataType, so it need get classifier param, and it get None, so raise error.

We can infer the classifier class name from the saved classifier models object, and then use java reflection to create classifier object and set it into classifier param. @ancasarb Could you help create RP for it ? I am too busy and may not have time to create PR.

ancasarb commented 4 years ago

Sure thing, I'll take a look!

WeichenXu123 commented 4 years ago

@ancasarb Do you plan to create PR for this ? I will help review :)