combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 312 forks source link

MLeap BinaryLogisticRegressionModel calculating result differs with the Spark model #339

Open ihainan opened 6 years ago

ihainan commented 6 years ago

Hi there. I tried to train a Spark BinaryLogisticRegressionModel with a dataset whose labels are the same value and used this model to make predictions.

// data
val rddData = sc.parallelize(Seq[(Integer, Double, Double, Double, Double, Double, Double, Double, Double, Double, Double, Double, Double, Double, Double)](
    (1, 0.2, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4),
    (1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.8, 0.4, 0.2, 0.1, 1.2, 1.1, 1.1, 1.0, 0.33),
    (1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.8, 0.4, 0.2, 0.1, 1.2, 1.1, 1.1, 1.0, 0.33)))
val data = spark.createDataFrame(rddData).toDF("LABEL", "C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", 
"C9", "C10", "C11", "C12", "C13", "C14")

// transformers & estimators
val assembler = new VectorAssembler().setInputCols(Array("C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10", "C11", "C12", "C13", "C14")).setOutputCol("features")
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(2)
val lr = new LogisticRegression().setLabelCol("LABEL")

The result looks fine:

{
  "probability":[0.0,1.0], 
  "prediction":1.0
}

After converting to MLeap model, the "probabilities" are all nulls, the prediction result is incorrect as well.

{
  "probability": [null, null],
  "prediction": 0.0
}

Spark Version: 2.1.1 MLeap Version: 0.7.0

Seems that Spark set the intercept parameter to Double.PositiveInfinity but MLeap can't handle this situation.

// org.apache.spark.ml.classification.LogisticRegression
val interceptVec = if (isMultinomial) {
  Vectors.sparse(numClasses, Seq((constantLabelIndex, Double.PositiveInfinity)))
 } else {
   Vectors.dense(if (numClasses == 2) Double.PositiveInfinity else Double.NegativeInfinity)
}
// ml.combust.mleap.core.classification.BinaryLogisticRegressionModel
def margin(features: Vector): Double = {
    BLAS.dot(features, coefficients) + intercept
}
hollinwilkins commented 6 years ago

@ihainan Would you be able to have a look at a fix for this and submit a PR? I think we would need to support serializing doubles as positive and negative infinity in the JsonSupport file in the bundle-ml submodule of MLeap.

marvinxu-free commented 4 years ago

seems not resolved?