combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 312 forks source link

MleapSpringBoot - Multi-input StringIndexer not supported yet #784

Open inardini opened 3 years ago

inardini commented 3 years ago

To whom it may concern,

I'm trying to deploy an PySpark pipeline using the MLeap bundle with combustml/mleap-spring-boot:0.19.0-SNAPSHOT docker image. And I get this error:

[MleapSpringBoot-akka.actor.default-dispatcher-6] [akka://MleapSpringBoot/user/transform/model] 
Cannot load bundle because: java.lang.UnsupportedOperationException: Multi-input StringIndexer not supported yet.

Any insights how can I fix it?

The bundle has the following structure

model
├── bundle.json
└── root
    ├── RandomForestClassifier_e24b4862ceb2.node
    │   ├── model.json
    │   ├── node.json
    │   ├── tree0
    | .......
    │   └── tree9
    │       ├── model.json
    │       └── tree.json
    ├── StandardScaler_a24a7bb9bb7b.node
    │   ├── model.json
    │   └── node.json
    ├── StringIndexer_07ad6a29446e.node
    │   ├── model.json
    │   └── node.json
    ├── StringIndexer_397d06fcffaa.node
    │   ├── model.json
    │   └── node.json
    ├── VectorAssembler_56af20ae6ed6.node
    │   ├── model.json
    │   └── node.json
    ├── VectorAssembler_c118350511db.node
    │   ├── model.json
    │   └── node.json
    ├── model.json
    └── node.json

and it was trained using ml.combust.mleap:mleap-runtime_2.12:0.18.1 and ml.combust.mleap:mleap-spark_2.12:0.18.1 with spark version: 3.1.2.

Thanks

jsleight commented 3 years ago

The error means you are using StringIndexer with the multi-column in/out formats. I.e., you set the InputCols parameter (and maybe the OutputCols parameter). This is a new feature added in Spark 3. Mleap does support spark 3, but doesn't yet support 100% of the capabilities (we try to throw exceptions like this when support isn't available yet).

As a workaround, you can replace your multi-column StringIndexer with multiple single-column StringIndexer. E.g., supposing you had code like this right now:

indexer = StringIndexer(inputCols=["foo", "bar", "baz"], outputCols=["a", "b", "c"])
pipe = Pipeline(stages=[...,indexer,...])

Then change it to:

indexer1 = StringIndexer(inputCol="foo", outputCol="a')
indexer2 = StringIndexer(inputCol="bar", outputCol="b")
indexer3 = StringIndexer(inputCol="baz", outputCol="c")
pipe = Pipeline(stages=[...,indexer1, indexer2, indexer3, ...])

Will be functionally equivalent and be supported in mleap.