combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 312 forks source link

Mleap transform not support null values. #597

Open psc0606 opened 4 years ago

psc0606 commented 4 years ago

I get this error, when i use mleap bundle model. because one of input value is missing. But the mleap cannot process this problem. Anyone can help? My mleap version: 0.13.0

scala.MatchError: null at ml.combust.mleap.core.feature.VectorAssemblerModel$$anonfun$apply$3.apply(VectorAssemblerModel.scala:37) at ml.combust.mleap.core.feature.VectorAssemblerModel$$anonfun$apply$3.apply(VectorAssemblerModel.scala:37) at scala.collection.immutable.Stream.foreach(Stream.scala:594) at ml.combust.mleap.core.feature.VectorAssemblerModel.apply(VectorAssemblerModel.scala:37) at ml.combust.mleap.runtime.transformer.feature.VectorAssembler$$anonfun$1.apply(VectorAssembler.scala:18) at ml.combust.mleap.runtime.transformer.feature.VectorAssembler$$anonfun$1.apply(VectorAssembler.scala:18) at ml.combust.mleap.runtime.frame.Row$class.udfValue(Row.scala:241) at ml.combust.mleap.runtime.frame.ArrayRow.udfValue(ArrayRow.scala:17) at ml.combust.mleap.runtime.frame.Row$class.withValue(Row.scala:221) at ml.combust.mleap.runtime.frame.ArrayRow.withValue(ArrayRow.scala:17) at ml.combust.mleap.runtime.frame.DefaultLeapFrame$$anonfun$withColumn$1$$anonfun$apply$2$$anonfun$2.apply(DefaultLeapFrame.scala:54) at ml.combust.mleap.runtime.frame.DefaultLeapFrame$$anonfun$withColumn$1$$anonfun$apply$2$$anonfun$2.apply(DefaultLeapFrame.scala:54) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:1120) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:1120) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:1109) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:1109) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:1114) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)

psc0606 commented 4 years ago

@hollinwilkins

ancasarb commented 4 years ago

@psc0606 I tried to do this, to include a null value with a vector assembler without mleap and it looks like that's also not supported, so it looks like this is expected?

Here's the small example I've tried

import org.apache.spark.ml.parity.SparkParityBase
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types.{DoubleType, StructType}

import scala.util.Random

def randomRow(): Row = Row(Random.nextDouble(), null)

val rows = spark.sparkContext.parallelize(Seq.tabulate(1) { _ => randomRow() })
val schema = new StructType()
    .add("real", DoubleType, nullable = false)
    .add("another_real", DoubleType, nullable = true)

val dataset: DataFrame = spark.sqlContext.createDataFrame(rows, schema)

val sparkTransformer: Transformer = new VectorAssembler().
    setInputCols(Array("real", "another_real")).
    setOutputCol("features")

display(sparkTransformer.transform(dataset))

To get this to work, you'd need to use an Imputer or some similar transformer to impute the null values first.

Do you have an example where Spark works and MLeap doesn't that I could take a look?

bkusumakar commented 3 years ago

@ancasarb by creating the VectorAssembler as shown below will handle the null values in spark

      new VectorAssembler()
        .setHandleInvalid("keep")
        .setInputCols(Array("real", "another_real"))
        .setOutputCol("features")

But the mleap corresponding to this still fails for null values with the scala.MatchError: null