Open kusumakarb opened 3 years ago
I can see how this would be annoying when trying to make predictions with the mleap transformer since the mleap input schema isn't the same as the spark input schema.
It is possible to make a PR that would parameterize the StringIndexerModel so that it could work with either numeric or string. Basically just making another input arg for the type, applying some case statements in the model, and fixing the mleap transformer + op to handle the new input arg. The part I'm not sure about is whether the spark StringIndexerOp has the necessary information to persist that parameter as part of the bundle. Depends on whether the SparkBundleContext
has the dataframe schema or not.
I have a dataset with a numerical column
rank
with values ranging from 1 to 5. When this dataset is read in spark withinferSchema = true
, the dataType of the column is inferred asint
by spark. A model is built by applyingStringIndexer
on therank
column as one of the stages and theorg.apache.spark.ml.PipelineModel
is exported as an mleap bundle. When we read theml.combust.mleap.runtime.frame.Transformer
from the mleap bundle and observe thetransformer.inputSchema
, it returns thedataType
of therank
column asString
instead of anint
.Looks like this is because of https://github.com/combust/mleap/blob/master/mleap-core/src/main/scala/ml/combust/mleap/core/feature/StringIndexerModel.scala#L50
According to spark docs for StringIndexer,
so, an input column for a string indexer stage can be a numeric or string type.
Is there a way to pass the information of the actual datatype of the column to the transformer, so that
transformer.inputSchema
can return same instead ofString
for all the cases?