combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 312 forks source link

Input schema of a StringIndexed Column is always string #761

Open kusumakarb opened 3 years ago

kusumakarb commented 3 years ago

I have a dataset with a numerical column rank with values ranging from 1 to 5. When this dataset is read in spark with inferSchema = true, the dataType of the column is inferred as int by spark. A model is built by applying StringIndexer on the rank column as one of the stages and the org.apache.spark.ml.PipelineModel is exported as an mleap bundle. When we read the ml.combust.mleap.runtime.frame.Transformer from the mleap bundle and observe the transformer.inputSchema, it returns the dataType of the rank column as String instead of an int.

Looks like this is because of https://github.com/combust/mleap/blob/master/mleap-core/src/main/scala/ml/combust/mleap/core/feature/StringIndexerModel.scala#L50

According to spark docs for StringIndexer,

If the input column is numeric, we cast it to string and index the string values.

so, an input column for a string indexer stage can be a numeric or string type.

Is there a way to pass the information of the actual datatype of the column to the transformer, so that transformer.inputSchema can return same instead of String for all the cases?

jsleight commented 3 years ago

I can see how this would be annoying when trying to make predictions with the mleap transformer since the mleap input schema isn't the same as the spark input schema.

It is possible to make a PR that would parameterize the StringIndexerModel so that it could work with either numeric or string. Basically just making another input arg for the type, applying some case statements in the model, and fixing the mleap transformer + op to handle the new input arg. The part I'm not sure about is whether the spark StringIndexerOp has the necessary information to persist that parameter as part of the bundle. Depends on whether the SparkBundleContext has the dataframe schema or not.