combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 310 forks source link

java.lang.NoClassDefFoundError: scalapb/Message when serializing PySpark Model #675

Closed its-felix closed 4 years ago

its-felix commented 4 years ago

I get the following exception in my local PySpark Pipeline when I try to serialize the model using MLeap

Traceback (most recent call last):
  File ".\examples\src\main\python\ml\random_forest_classifier_example.py", line 88, in <module>
    model.serializeToBundle("jar:file:/Users/fwollsch/Downloads/test.zip", model.transform(trainingData))
  File "C:\Program Files\Python37\lib\site-packages\mleap\pyspark\spark_support.py", line 25, in serializeToBundle
    serializer.serializeToBundle(self, path, dataset=dataset)
  File "C:\Program Files\Python37\lib\site-packages\mleap\pyspark\spark_support.py", line 42, in serializeToBundle
    self._java_obj.serializeToBundle(transformer._to_java(), path, dataset._jdf)
  File "C:\Program Files\Python37\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Program Files\Python37\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Program Files\Python37\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o408.serializeToBundle.
: java.lang.NoClassDefFoundError: scalapb/Message
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at ml.combust.bundle.dsl.Value$.stringList(Value.scala:207)
        at org.apache.spark.ml.bundle.ops.feature.StringIndexerOp$$anon$1.store(StringIndexerOp.scala:20)
        at org.apache.spark.ml.bundle.ops.feature.StringIndexerOp$$anon$1.store(StringIndexerOp.scala:13)
        at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:87)
        at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:83)
        at scala.util.Try$.apply(Try.scala:192)
        at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:83)
        at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:85)
        at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:81)
        at scala.util.Try$.apply(Try.scala:192)
        at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81)
        at ml.combust.bundle.serializer.GraphSerializer$$anonfun$writeNode$1.apply(GraphSerializer.scala:34)
        at ml.combust.bundle.serializer.GraphSerializer$$anonfun$writeNode$1.apply(GraphSerializer.scala:30)
        at scala.util.Try$.apply(Try.scala:192)
        at ml.combust.bundle.serializer.GraphSerializer.writeNode(GraphSerializer.scala:30)
        at ml.combust.bundle.serializer.GraphSerializer$$anonfun$write$2.apply(GraphSerializer.scala:21)
        at ml.combust.bundle.serializer.GraphSerializer$$anonfun$write$2.apply(GraphSerializer.scala:21)
        at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
        at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
        at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
        at ml.combust.bundle.serializer.GraphSerializer.write(GraphSerializer.scala:20)
        at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:21)
        at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:14)
        at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:87)
        at ml.combust.bundle.serializer.ModelSerializer$$anonfun$write$1.apply(ModelSerializer.scala:83)
        at scala.util.Try$.apply(Try.scala:192)
        at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:83)
        at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:85)
        at ml.combust.bundle.serializer.NodeSerializer$$anonfun$write$1.apply(NodeSerializer.scala:81)
        at scala.util.Try$.apply(Try.scala:192)
        at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81)
        at ml.combust.bundle.serializer.BundleSerializer$$anonfun$write$1.apply(BundleSerializer.scala:34)
        at ml.combust.bundle.serializer.BundleSerializer$$anonfun$write$1.apply(BundleSerializer.scala:29)
        at scala.util.Try$.apply(Try.scala:192)
        at ml.combust.bundle.serializer.BundleSerializer.write(BundleSerializer.scala:29)
        at ml.combust.bundle.BundleWriter.save(BundleWriter.scala:31)
        at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$serializeToBundleWithFormat$2.apply(SimpleSparkSerializer.scala:26)
        at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$serializeToBundleWithFormat$2.apply(SimpleSparkSerializer.scala:25)
        at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:88)
        at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
        at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
        at scala.util.control.Exception$Catch.apply(Exception.scala:103)
        at scala.util.control.Exception$Catch.either(Exception.scala:125)
        at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88)
        at resource.ManagedResourceOperations$class.apply(ManagedResourceOperations.scala:26)
        at resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50)
        at resource.DeferredExtractableManagedResource$$anonfun$tried$1.apply(AbstractManagedResource.scala:33)
        at scala.util.Try$.apply(Try.scala:192)
        at resource.DeferredExtractableManagedResource.tried(AbstractManagedResource.scala:33)
        at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:27)
        at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        ... 74 more

I'm using the random_forest_classifier_example.py from the pyspark-examples, with the addition of MLeap:

    [...]

    # serialize using mleap ( https://mleap-docs.combust.ml/py-spark/ )
    # Imports MLeap serialization functionality for PySpark
    import mleap.pyspark
    from mleap.pyspark.spark_support import SimpleSparkSerializer

    # SimpleSparkSerializer().serializeToBundle(model, "jar:file:/Users/fwollsch/Downloads/test.zip", dataset = trainingData)
    model.serializeToBundle("jar:file:/Users/fwollsch/Downloads/test.zip", model.transform(trainingData))

    spark.stop()

OS: Windows 10 MLeap (installed using pip) in Version 0.15.0 PySpark: 2.4.5 Python: 3.7.2

I have added the missing jars to the jars directory of my PySpark Installation. The following jars are currently in my /jars directory:

activation-1.1.1.jar
aircompressor-0.10.jar
antlr-2.7.7.jar
antlr4-runtime-4.7.jar
antlr-runtime-3.4.jar
aopalliance-1.0.jar
aopalliance-repackaged-2.4.0-b34.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
apache-log4j-extras-1.2.17.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
arpack_combined_all-0.1.jar
arrow-format-0.10.0.jar
arrow-memory-0.10.0.jar
arrow-vector-0.10.0.jar
automaton-1.11-8.jar
avro-1.8.2.jar
avro-ipc-1.8.2.jar
avro-mapred-1.8.2-hadoop2.jar
bonecp-0.8.0.RELEASE.jar
breeze_2.11-0.13.2.jar
breeze-macros_2.11-0.13.2.jar
bundle-hdfs_2.11-0.15.0.jar
bundle-ml_2.11-0.15.0.jar
calcite-avatica-1.2.0-incubating.jar
calcite-core-1.2.0-incubating.jar
calcite-linq4j-1.2.0-incubating.jar
chill_2.11-0.9.3.jar
chill-java-0.9.3.jar
commons-beanutils-1.9.4.jar
commons-cli-1.2.jar
commons-codec-1.10.jar
commons-collections-3.2.2.jar
commons-compiler-3.0.9.jar
commons-compress-1.8.1.jar
commons-configuration-1.6.jar
commons-crypto-1.0.0.jar
commons-dbcp-1.4.jar
commons-digester-1.8.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-lang3-3.5.jar
commons-logging-1.1.3.jar
commons-math3-3.4.1.jar
commons-net-3.1.jar
commons-pool-1.5.4.jar
compilerplugin_2.11-0.10.0-M4.jar
compilerplugin-shaded_2.11-0.10.0-M4.jar
compress-lzf-1.0.3.jar
config-1.4.0.jar
core-1.1.2.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
derby-10.12.1.1.jar
eigenbase-properties-1.1.5.jar
flatbuffers-1.2.0-3f79e055.jar
generex-1.0.2.jar
gson-2.2.4.jar
guava-14.0.1.jar
guice-3.0.jar
guice-servlet-3.0.jar
hadoop-annotations-2.7.3.jar
hadoop-auth-2.7.3.jar
hadoop-client-2.7.3.jar
hadoop-common-2.7.3.jar
hadoop-hdfs-2.7.3.jar
hadoop-mapreduce-client-app-2.7.3.jar
hadoop-mapreduce-client-common-2.7.3.jar
hadoop-mapreduce-client-core-2.7.3.jar
hadoop-mapreduce-client-jobclient-2.7.3.jar
hadoop-mapreduce-client-shuffle-2.7.3.jar
hadoop-yarn-api-2.7.3.jar
hadoop-yarn-client-2.7.3.jar
hadoop-yarn-common-2.7.3.jar
hadoop-yarn-server-common-2.7.3.jar
hadoop-yarn-server-web-proxy-2.7.3.jar
hive-beeline-1.2.1.spark2.jar
hive-cli-1.2.1.spark2.jar
hive-exec-1.2.1.spark2.jar
hive-jdbc-1.2.1.spark2.jar
hive-metastore-1.2.1.spark2.jar
hk2-api-2.4.0-b34.jar
hk2-locator-2.4.0-b34.jar
hk2-utils-2.4.0-b34.jar
hppc-0.7.2.jar
htrace-core-3.1.0-incubating.jar
httpclient-4.5.6.jar
httpcore-4.4.10.jar
ivy-2.4.0.jar
jackson-annotations-2.6.7.jar
jackson-core-2.6.7.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.6.7.3.jar
jackson-dataformat-yaml-2.6.7.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-jaxb-annotations-2.6.7.jar
jackson-module-paranamer-2.7.9.jar
jackson-module-scala_2.11-2.6.7.1.jar
jackson-xc-1.9.13.jar
janino-3.0.9.jar
JavaEWAH-0.3.2.jar
javassist-3.18.1-GA.jar
javax.annotation-api-1.2.jar
javax.inject-1.jar
javax.inject-2.4.0-b34.jar
javax.servlet-api-3.1.0.jar
javax.ws.rs-api-2.0.1.jar
javolution-5.5.1.jar
jaxb-api-2.2.2.jar
jcl-over-slf4j-1.7.16.jar
jdo-api-3.0.1.jar
jersey-client-2.22.2.jar
jersey-common-2.22.2.jar
jersey-container-servlet-2.22.2.jar
jersey-container-servlet-core-2.22.2.jar
jersey-guava-2.22.2.jar
jersey-media-jaxb-2.22.2.jar
jersey-server-2.22.2.jar
jetty-6.1.26.jar
jetty-util-6.1.26.jar
jline-2.14.6.jar
joda-time-2.9.3.jar
jodd-core-3.5.2.jar
jpam-1.1.jar
json4s-ast_2.11-3.5.3.jar
json4s-core_2.11-3.5.3.jar
json4s-jackson_2.11-3.5.3.jar
json4s-scalap_2.11-3.5.3.jar
jsp-api-2.1.jar
jsr305-1.3.9.jar
jta-1.1.jar
jtransforms-2.4.0.jar
jul-to-slf4j-1.7.16.jar
kryo-shaded-4.0.2.jar
kubernetes-client-4.6.1.jar
kubernetes-model-4.6.1.jar
kubernetes-model-common-4.6.1.jar
lenses_2.11-0.10.0-M4.jar
leveldbjni-all-1.8.jar
libfb303-0.9.3.jar
libthrift-0.9.3.jar
log4j-1.2.17.jar
logging-interceptor-3.12.0.jar
lz4-java-1.4.0.jar
machinist_2.11-0.6.1.jar
macro-compat_2.11-1.1.1.jar
mesos-1.4.0-shaded-protobuf.jar
metrics-core-3.1.5.jar
metrics-graphite-3.1.5.jar
metrics-json-3.1.5.jar
metrics-jvm-3.1.5.jar
minlog-1.3.0.jar
mleap-base_2.11-0.15.0.jar
mleap-core_2.11-0.15.0.jar
mleap-executor_2.11-0.15.0.jar
mleap-runtime_2.11-0.15.0.jar
mleap-spark_2.11-0.15.0.jar
mleap-spark-base_2.11-0.15.0.jar
mleap-spark-extension_2.11-0.15.0.jar
mleap-tensor_2.11-0.15.0.jar
netty-3.9.9.Final.jar
netty-all-4.1.42.Final.jar
objenesis-2.5.1.jar
okhttp-3.12.0.jar
okio-1.15.0.jar
opencsv-2.3.jar
orc-core-1.5.5-nohive.jar
orc-mapreduce-1.5.5-nohive.jar
orc-shims-1.5.5.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
parquet-column-1.10.1.jar
parquet-common-1.10.1.jar
parquet-encoding-1.10.1.jar
parquet-format-2.4.0.jar
parquet-hadoop-1.10.1.jar
parquet-hadoop-bundle-1.6.0.jar
parquet-jackson-1.10.1.jar
protobuf-java-2.5.0.jar
protobuf-runtime-scala_2.11-0.8.3.jar
protoc-bridge_2.11-0.7.14.jar
py4j-0.10.7.jar
pyrolite-4.13.jar
RoaringBitmap-0.7.45.jar
scala-arm_2.11-2.0.jar
scala-compiler-2.11.12.jar
scala-library-2.11.12.jar
scala-parser-combinators_2.11-1.1.0.jar
scalapbc_2.11-0.10.0-M4.jar
scalapb-json4s_2.11-0.10.1-M1.jar
scalapb-runtime_2.11-0.10.0-M4.jar
scalapb-runtime-grpc_2.11-0.10.0-M4.jar
scala-reflect-2.11.12.jar
scala-xml_2.11-1.0.5.jar
shapeless_2.11-2.3.2.jar
shims-0.7.45.jar
slf4j-api-1.7.16.jar
slf4j-log4j12-1.7.16.jar
snakeyaml-1.15.jar
snappy-0.2.jar
snappy-java-1.1.7.3.jar
spark-catalyst_2.11-2.4.5.jar
spark-core_2.11-2.4.5.jar
spark-graphx_2.11-2.4.5.jar
spark-hive_2.11-2.4.5.jar
spark-hive-thriftserver_2.11-2.4.5.jar
spark-kubernetes_2.11-2.4.5.jar
spark-kvstore_2.11-2.4.5.jar
spark-launcher_2.11-2.4.5.jar
spark-mesos_2.11-2.4.5.jar
spark-mllib_2.11-2.4.5.jar
spark-mllib-local_2.11-2.4.5.jar
spark-network-common_2.11-2.4.5.jar
spark-network-shuffle_2.11-2.4.5.jar
spark-repl_2.11-2.4.5.jar
spark-sketch_2.11-2.4.5.jar
spark-sql_2.11-2.4.5.jar
sparksql-scalapb_2.11-0.9.2.jar
spark-streaming_2.11-2.4.5.jar
spark-tags_2.11-2.4.5.jar
spark-tags_2.11-2.4.5-tests.jar
spark-unsafe_2.11-2.4.5.jar
spark-yarn_2.11-2.4.5.jar
spire_2.11-0.13.0.jar
spire-macros_2.11-0.13.0.jar
ST4-4.0.4.jar
stax-api-1.0.1.jar
stax-api-1.0-2.jar
stream-2.7.0.jar
stringtemplate-3.2.1.jar
super-csv-2.2.0.jar
univocity-parsers-2.7.3.jar
validation-api-1.1.0.Final.jar
xbean-asm6-shaded-4.8.jar
xercesImpl-2.9.1.jar
xmlenc-0.52.jar
xz-1.5.jar
zjsonpatch-0.3.0.jar
zookeeper-3.4.6.jar
zstd-jni-1.3.2-2.jar
ancasarb commented 4 years ago

@codeflush-dev This looks like it could be a scalapb versioning issue, perhaps?

    println(scalapb.compiler.Version.scalapbVersion)

Inside mleap it seems this prints 0.7.1, perhaps try with that instead of 0.10.1-M1/0.10.1-M4?

its-felix commented 4 years ago

@codeflush-dev This looks like it could be a scalapb versioning issue, perhaps?

    println(scalapb.compiler.Version.scalapbVersion)

Inside mleap it seems this prints 0.7.1, perhaps try with that instead of 0.10.1-M1/0.10.1-M4?

I'll try that and come back to you. Thanks :)

its-felix commented 4 years ago

Works a expected now