combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 313 forks source link

Fixed an issue with OneHotEncoderOp when computing categorySizes #873

Closed ltrottier-yelp closed 4 months ago

ltrottier-yelp commented 4 months ago

The spark feature org.apache.spark.ml.feature.OneHotEncoderModel has two mixins for the input columns: inputCol and inputCols. We need to check which param is set and use that correct one to compute categorySizes.

Tests pass locally:

$ sbt "mleap-spark/testOnly *OneHotEncoderParitySpec*"
[info] OneHotEncoderParitySpec:
[info] - has parity between Spark/MLeap
[info] - serializes/deserializes the Spark model properly
[info] - model input/output schema matches transformer UDF
[info] - serializes/deserializes the Spark model properly with one in/out column
[info] - fails to instantiate if the Spark model sets inputCol and inputCols
[info] - fails to instantiate if the Spark model sets outputCol and outputCols
[info] Run completed in 8 seconds, 315 milliseconds.
[info] Total number of tests run: 6
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
ltrottier-yelp commented 4 months ago

Ok I will add new tests