combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 312 forks source link

key not found: org.apache.spark.ml.feature.ImputerModel #354

Open caesarjuly opened 6 years ago

caesarjuly commented 6 years ago

According to the doc, imputer is supported. But I get this error when trying to save bundlefile. Here are my dependency versions: ''spark 2.2.0'' "ml.combust.mleap" %% "mleap-runtime" % "0.9.5", "ml.combust.mleap" %% "mleap-spark" % "0.9.5" I don't know what to do, can u help me, please~

Exception in thread "main" java.util.NoSuchElementException: key not found: org.apache.spark.ml.feature.ImputerModel
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at ml.combust.bundle.BundleRegistry.opForObj(BundleRegistry.scala:84)
    at ml.combust.bundle.serializer.GraphSerializer$$anonfun$writeNode$1.apply(GraphSerializer.scala:31)
    at ml.combust.bundle.serializer.GraphSerializer$$anonfun$writeNode$1.apply(GraphSerializer.scala:30)
    at scala.util.Try$.apply(Try.scala:192)
    at ml.combust.bundle.serializer.GraphSerializer.writeNode(GraphSerializer.scala:30)
    at ml.combust.bundle.serializer.GraphSerializer$$anonfun$write$2.apply(GraphSerializer.scala:21)
    at ml.combust.bundle.serializer.GraphSerializer$$anonfun$write$2.apply(GraphSerializer.scala:21)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
    at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
    at ml.combust.bundle.serializer.GraphSerializer.write(GraphSerializer.scala:20)
    at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:21)
    at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:14)
    at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:87)
    at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:83)
    at scala.util.Try$.apply(Try.scala:192)
    at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:83)
    at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:85)
    at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:81)
    at scala.util.Try$.apply(Try.scala:192)
    at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81)
    at ml.combust.bundle.serializer.BundleSerializer$$anonfun$write$1.apply(BundleSerializer.scala:34)
    at ml.combust.bundle.serializer.BundleSerializer$$anonfun$write$1.apply(BundleSerializer.scala:29)
    at scala.util.Try$.apply(Try.scala:192)
    at ml.combust.bundle.serializer.BundleSerializer.write(BundleSerializer.scala:29)
    at ml.combust.bundle.BundleWriter.save(BundleWriter.scala:26)
    at com.zhihu.saturn.offline.process.CalculateLRModel$$anonfun$training$2.apply(CalculateLRModel.scala:161)
    at com.zhihu.saturn.offline.process.CalculateLRModel$$anonfun$training$2.apply(CalculateLRModel.scala:160)
    at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:88)
    at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
    at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
    at scala.util.control.Exception$Catch.apply(Exception.scala:103)
    at scala.util.control.Exception$Catch.either(Exception.scala:125)
    at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88)
    at resource.ManagedResourceOperations$class.apply(ManagedResourceOperations.scala:26)
    at resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50)
    at resource.ManagedResourceOperations$class.acquireAndGet(ManagedResourceOperations.scala:25)
    at resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:50)
    at resource.ManagedResourceOperations$class.foreach(ManagedResourceOperations.scala:53)
    at resource.AbstractManagedResource.foreach(AbstractManagedResource.scala:50)
    at com.zhihu.saturn.offline.process.CalculateLRModel$.training(CalculateLRModel.scala:160)
    at com.zhihu.saturn.offline.process.CalculateLRModel$.run(CalculateLRModel.scala:51)
    at com.zhihu.saturn.offline.process.CalculateLRModel$.main(CalculateLRModel.scala:225)
    at com.zhihu.saturn.offline.process.CalculateLRModel.main(CalculateLRModel.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ancasarb commented 6 years ago

@caesarjuly Try adding "ml.combust.mleap" %% "mleap-spark-extension" % "0.9.5" as a dependency and use

import org.apache.spark.ml.mleap.feature.Imputer

from there instead.

I believe that the out of the box Spark transformer can work on multiple columns and that isn't supported at the moment in MLeap. The transformer from mleap-spark-extension works the same as Spark's, with the additional restriction that it works on just a single column.

caesarjuly commented 6 years ago

@ancasarb Thank u very much. That should be the key. I'll try it later. There is another question. What's the relationship between mleap-spark and mleap-spark-extension? Which one should I use?

caesarjuly commented 6 years ago

Btw, after reading and using this project. I really want to participate in it. Is there any way to join? I feel that there are many features waited to be added.

gabtibe commented 6 years ago

Any developments on this issue?

ancasarb commented 6 years ago

@gabtibe please see the answer above, about using the Imputer from mleap-extensions. Let me know if you have any questions!

gabtibe commented 6 years ago

@ancasarb Thanks for your response; I did use the Imputer from spark-extension but was wondering if there's any plan to support the standard Imputer from Spark as I noticed it speeds up computation and reduce the number of steps in the pipeline to be saved

kevinykuo commented 6 years ago

+1 for support of Spark ImputerModel, makes exporting existing pipelines much easier.

benoua commented 5 years ago

I have the same issue working with Pyspark 2.3.0. mleap-spark-extension are not available in python from what I saw.

botchniaque commented 2 years ago

I am very confused about the mleap's support for spark Imputer.

Why doesn't the documentation mention that the Imputer in spark is only supported when using an object from mleap-spark-extension? The Supported transformers table mentions that Imputer is supported (without any explanation).

Do I understand correctly that it's not possible to instantiate the Imputer when creating a pipeline in pyspark?

EDIT: I have asked a SO question about the above https://stackoverflow.com/questions/71209926/mleap-support-spark-ml-imputer as well