combust / mleap-docs

Documentation for MLeap
14 stars 23 forks source link

serializeToBundle object issue #8

Open drkmd8 opened 7 years ago

drkmd8 commented 7 years ago

If running the code as above, there's an issue with featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip"). AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'.

If use the following code: featurePipeline2 = featurePipeline.fit(df2) featurePipeline2.serializeToBundle("jar:file:/tmp/pyspark.example.zip") There is an error with self._java_obj = _jvm().ml.combust.mleap.spark.SimpleSparkSerializer(), saying "TypeError: 'JavaPackage' object is not callable"

How to solve it?

hollinwilkins commented 6 years ago

@drkmd8 Can you give us version information for both the Python and MLeap JVM packages as well as Spark that you are using?

drkmd8 commented 6 years ago

I used Python 3.6.1, Mleap 0.8.1 (pip install mleap) and pyspark 2.1.1+hadoop2.7. This problem was caused by python mleap library I think, because scala seems to work fine but python requires running with external jar files where mleap classes are included.

alexkayal commented 6 years ago

I have the same issue at the moment as well.

tianhongjie commented 6 years ago

Yes, I have the same issue, my solution is add the jar file to the pyspark jars dir which at the python package path:site-packages/pyspark/jars/ . I add some jars such as below: mleap-base_2.11-0.10.0.jar mleap-core_2.11-0.10.0.jar mleap-runtime_2.11-0.10.0.jar mleap-spark_2.11-0.10.0.jar mleap-spark-base_2.11-0.10.0.jar mleap-tensor_2.11-0.10.0.jar

I hope it is helpful for you.

alexkayal commented 6 years ago

I also solved it by adding the jars manually to /usr/lib/spark/jars. But I guess there is a better way to do it: just sudo pip install jip Then install MLeap. Jip is supposed to take care of your Java dependencies if i understand correctly.

Khiem-Tran commented 6 years ago

Hi @alexkayal & @tianhongjie, I have tried your solution, it fixed the JavaPackage issue but then I got another one.

Py4JJavaError: An error occurred while calling o261.serializeToBundle.
: java.lang.NoClassDefFoundError: com/trueaccord/scalapb/GeneratedEnum

I am not sure how it can happen, since my dataframe only has primitive types {int, double}. Do you have any ideas on it?

samant2008 commented 6 years ago

@alexkayal @tianhongjie @Khiem-Tran , I am also getting the same error. Any idea how to resolve this issue?`

Py4JJavaError: **An error occurred while calling o414.serializeToBundle. : java.lang.NoClassDefFoundError: com/trueaccord/scalapb/GeneratedEnum** at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.trueaccord.scalapb.GeneratedEnum at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 23 more

elgalu commented 6 years ago

I fixed the com/trueaccord related errors by adding lenses_2.11-0.4.12.jar

samant2008 commented 6 years ago

@elgalu ,

Thank you so much for your prompt response. I have included the lenses_2.11-0.4.12.jar but still I m getting the same error as above. Do you have any other suggestion to resolve this issue?

elgalu commented 6 years ago

Make sure is in the CLASSPATH, note Py4J has its own jars/ folder. And if you install pyspark separately it also comes with its own jars/ folder. What I do is remove all those jars directories and symlink to 1 /jars where a put together the whole set of working versions.

You can find all my working jars at: https://github.com/elgalu/jupyter-spark-117/tree/master/spark/jars

Pending: to build an sbt or pom.xml project (instead of a bunch of jars)

siyouhe666 commented 5 years ago

@elgalu ,

Thank you so much for your prompt response. I have included the lenses_2.11-0.4.12.jar but still I m getting the same error as above. Do you have any other suggestion to resolve this issue?

I am same to you, have you find an answer?

Khiem-Tran commented 5 years ago

@elgalu @siyouhe666, I have been using spark 2.2.1 and the config --packages ml.combust.mleap:mleap-spark_2.11:0.11.0. It seems work for me

Btw, I also follow this (blog)[https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb] to fix xgboost dependency because somehow my mleap-xgboost does not work properly

siyouhe666 commented 5 years ago

@elgalu @siyouhe666, I have been using spark 2.2.1 and the config --packages ml.combust.mleap:mleap-spark_2.11:0.11.0. It seems work for me

Btw, I also follow this (blog)[https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb] to fix xgboost dependency because somehow my mleap-xgboost does not work properly

Thanks I have solved this problem through changed my spark version to 2.4.0, Btw, alough the official docs of mleap said they hadn't support for 2.4.0, but i found it works well

yairdata commented 5 years ago

i am using python 3.6 , pyspark 2.3.1. used mleap-core_2.11-0.11.0.jar mleap-spark-base_2.11-0.13.0.jar mleap-runtime_2.11-0.13.0.jar tried 2 approaches to have pyspark know mleap: git clone and use: sys.path.append('C:\my-mleap\mleap-master\python') and also by: os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars ....' or to pip install mleap (installs version 0.8.1)

when calling model.serializeToBundle(model_file_path, sparkTransformed) i get: Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM

SoloBean commented 5 years ago

@yairdata I am same to you, have you find an answer? Thanks a lot.

SoloBean commented 5 years ago

@yairdata I solved this problem by adjust the version of MLeap. Original I used 0.13.0, now I uesd 0.11.0, but raise another problem: Py4JJavaError: An error occurred while calling o126.serializeToBundle. : java.lang.NoClassDefFoundError: com/typesafe/config/ConfigFactory at org.apache.spark.ml.bundle.SparkBundleContext$.apply(SparkBundleContext.scala:37) at org.apache.spark.ml.bundle.SparkBundleContext$.defaultContext$lzycompute(SparkBundleContext.scala:31) at org.apache.spark.ml.bundle.SparkBundleContext$.defaultContext(SparkBundleContext.scala:31) at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$1.apply(SimpleSparkSerializer.scala:22) at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$1.apply(SimpleSparkSerializer.scala:22) at scala.Option.map(Option.scala:146) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:22) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.typesafe.config.ConfigFactory

yairdata commented 5 years ago

@SoloBean - i have solved this problem with 0.13.0 by specifying spark.jars.packages to point to ml.combust.mleap:mleap-spark-base_2.11:0.13.0,ml.combust.mleap:mleap-spark_2.11:0.13.0 . now i have other issue of missing jars , but this happens because i am behind a firewall , when i am without firewall everything is working +- as expected (i was able to export the model , but to a directory and not to a jar file as mentioned in the documentation)

SoloBean commented 5 years ago

@yairdata - I also solved this problem by add jars to /jars, but after I add all jars as I know, there raise another problem that I don't know how to solved this by add jar:

Py4JJavaError: An error occurred while calling o126.serializeToBundle. : java.lang.NoClassDefFoundError: resource/package$ at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: resource.package$ at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 13 more

yairdata commented 5 years ago

@SoloBean - i think there is some issue open for that about dependency conflict, not sure.

yairdata commented 5 years ago

weird jar dependency issue: i have com.trueaccord.scalapb:scalapb-runtime_2.11:0.6.7 in spark.jars.packages i see that it has the GeneratedEnum class in it, but i still get the below error. tried also to put the jar in the pyspark jars directory. also i have put the lenses jar that is mentioned above in the classpath. is there any other jar dependency that is hidden and not referenced as a dependency in the maven dependency tree ? the error: Py4JJavaError: An error occurred while calling o438.serializeToBundle. : java.lang.NoClassDefFoundError: scalapb/GeneratedEnum at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: scalapb.GeneratedEnum at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 24 more

nikhilshekhar commented 5 years ago

Took some time to figure it out, and hence putting the steps to resolve it below.

1) The issue as reported by @drkmd8 is seen when the java class in question cannot be accessed. Happens when all the relevant jars are not provided on the classpath. Can be resolved by passing in the JAR's via --jars args or placing it on classpath 2) Once, the above issue is resolved, one can still hit the issue pointed out by @yairdata. This happens because the JVM is unable to initialise the class. This happens because the location being looked into to instantiate the class is messed up.

The most straightforward way to circumvent both the above issues is to invoke pyspark via the below: pyspark --packages ml.combust.mleap:mleap-spark_2.11:0.11.0 The mleap version in the above can be chosen according to the compatibility matrix - https://github.com/combust/mleap#mleapspark-version If the above download of packages fail on some particular jar fetch, that can be manually downloaded and placed in corresponding .m2 directory and the command should be re-run. All should be good then.

This issue does not seem to be an issue and can be closed by admins. But I wonder, as to the reason of not publishing newer versions of mleap to PyPy.

xiangninglyu commented 5 years ago

@SoloBean Meeting with the same resource/package error as you, did you find the solution? I got:

py4j.protocol.Py4JJavaError: An error occurred while calling o103.serializeToBundle.
: java.lang.NoClassDefFoundError: resource/package$
    at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
    at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: resource.package$
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 13 more
sealzjh commented 5 years ago

I have the same issue and add jars then: File "/Users/alan/local/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1598, in getattr py4j.protocol.Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM

My version: Python 2.7.10 pyspark-2.4.0 spark-2.4.0-bin-hadoop2.7 jar: mleap-base_2.11-0.13.0.jar mleap-core_2.11-0.1.5.jar mleap-executor_2.11-0.13.0.jar mleap-runtime_2.11-0.13.0.jar mleap-spark-base_2.11-0.13.0.jar mleap-spark-testkit_2.11-0.13.0.jar mleap-spark_2.11-0.13.0.jar mleap-tensor_2.11-0.13.0.jar

itsmesrds commented 5 years ago

Hello @hollinwilkins.

Even i have the Same Issue and I added jars and i am also facing same issue

ERROR:root:Exception while sending command. Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command response = connection.send_command(command) File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM

Python 3 pyspark-2.4.0 spark-2.4.0-bin-hadoop2.7 jar: mleap-base_2.11-0.13.0.jar mleap-core_2.11-0.1.5.jar mleap-executor_2.11-0.13.0.jar mleap-runtime_2.11-0.13.0.jar mleap-spark-base_2.11-0.13.0.jar mleap-spark-testkit_2.11-0.13.0.jar mleap-spark_2.11-0.13.0.jar mleap-tensor_2.11-0.13.0.jar

Please Help me out of this

Thanks

yairdata commented 5 years ago

works with mleap 0.13.0 version i verified that using all the following jars when submitting command: spark-submit --master yarn --jars {jar list (each jar has to have the full path!)} my_python.py

com.github.rwl#jtransforms;2.4.0 from central in [default]
com.google.protobuf#protobuf-java;3.5.1 from central in [default]
com.jsuereth#scala-arm_2.11;2.0 from central in [default]
com.lihaoyi#fastparse-utils_2.11;1.0.0 from central in [default]
com.lihaoyi#fastparse_2.11;1.0.0 from central in [default]
com.lihaoyi#sourcecode_2.11;0.1.4 from central in [default]
com.thesamet.scalapb#lenses_2.11;0.7.0-test2 from central in [default]
com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 from central in [default]
com.typesafe#config;1.3.0 from central in [default]
io.spray#spray-json_2.11;1.3.2 from central in [default]
ml.combust.bundle#bundle-hdfs_2.11;0.13.0 from central in [default]
ml.combust.bundle#bundle-ml_2.11;0.13.0 from central in [default]
ml.combust.mleap#mleap-base_2.11;0.13.0 from central in [default]
ml.combust.mleap#mleap-core_2.11;0.13.0 from central in [default]
ml.combust.mleap#mleap-runtime_2.11;0.13.0 from central in [default]
ml.combust.mleap#mleap-spark-base_2.11;0.13.0 from central in [default]
ml.combust.mleap#mleap-spark_2.11;0.13.0 from central in [default]
ml.combust.mleap#mleap-tensor_2.11;0.13.0 from central in [default]
org.scala-lang#scala-reflect;2.11.8 from central in [default]
y-tee commented 4 years ago

hi @yairdata how did you manage to find out which version of the jar file is the compatible one? Anyone used mleap 0.15.0 yet?

yairdata commented 4 years ago

@y-tee - alot of trial & error ...i wish it was documented somewhere...since it wasn't i pasted it here to help others.

y-tee commented 4 years ago

@yairdata did you try all the versions :scream: Then i should prolly downgrade my mleap to 0.13.0, it works if i just change the github version.py to 0.13.0 instead of default (0.15.0) now since pip will give u a super old version?

yairdata commented 4 years ago

@y-tee not all versions , there are compatible jar versions , but not all of them are listed as dependencies, so this is trial & error. regarding newer mleap versions - didn't try them because i am using older spark version (2.3.1) that is compatible with mleap v0.13.0

ancasarb commented 4 years ago

I've release the python mleap version 0.15.0 just today, fyi https://pypi.org/project/mleap/#history, please let me know if you see any issues.

RuxuePeng commented 4 years ago

My mleap is 0.15.0, and Spark is 2.4.4, I'm having this issue again.
Code:

pipeline = pipeline.fit(feature_df)
predictions = pipeline.transform(feature_df)
model_local_path = "something"
model_path = "jar:file:" + model_local_path + "/model.zip"
pipeline.serializeToBundle(model_path, predictions)

Error: Encoutnered error: 'PipelineModel' object has no attribute 'serializeToBundle'"

"Encoutnered error: An error occurred while calling o1480.serializeToBundle.
: java.lang.ExceptionInInitializerError
    at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$1.apply(SimpleSparkSerializer.scala:22)
    at ml.combust.mleap.spark.SimpleSparkSerializer$$anonfun$1.apply(SimpleSparkSerializer.scala:22)
    at scala.Option.map(Option.scala:146)
    at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:22)
    at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: unsupported Spark version: 2.4.4
    at org.apache.spark.ml.bundle.SparkBundleContext$.<init>(SparkBundleContext.scala:27)
    at org.apache.spark.ml.bundle.SparkBundleContext$.<clinit>(SparkBundleContext.scala)
felixgao commented 4 years ago

I am also have problem with mleap 0.15.0 and Spark 2.4.4.
basically running the code in https://github.com/combust/mleap-demo/blob/master/notebooks/PySpark%20-%20AirBnb.ipynb

pyspark
[I 17:34:36.883 NotebookApp] Loading IPython parallel extension
...
[W 17:34:53.685 NotebookApp] 404 GET /nbextensions/nbextensions_configurator/config_menu/main.js?v=20200212173436 (::1) 7.00ms referer=http://localhost:8888/notebooks/MLeap.ipynb
[I 17:34:54.021 NotebookApp] Kernel started: 56fef487-0dea-47ba-8ad3-8c19241c1193
[W 17:34:54.175 NotebookApp] 404 GET /nbextensions/widgets/notebook/js/extension.js?v=20200212173436 (::1) 2.67ms referer=http://localhost:8888/notebooks/MLeap.ipynb
Ivy Default Cache set to: /Users/ggao/.ivy2/cache
The jars for the packages stored in: /Users/ggao/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/Cellar/apache-spark/2.4.4/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-avro_2.11 added as a dependency
ml.combust.mleap#mleap-spark_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dbeefc3f-8e12-443d-8629-8adf19670d42;1.0
    confs: [default]
    found org.apache.spark#spark-avro_2.11;2.4.4 in central
    found org.spark-project.spark#unused;1.0.0 in local-m2-cache
    found ml.combust.mleap#mleap-spark_2.11;0.15.0 in central
    found ml.combust.mleap#mleap-spark-base_2.11;0.15.0 in central
    found ml.combust.mleap#mleap-runtime_2.11;0.15.0 in central
    found ml.combust.mleap#mleap-core_2.11;0.15.0 in central
    found ml.combust.mleap#mleap-base_2.11;0.15.0 in central
    found ml.combust.mleap#mleap-tensor_2.11;0.15.0 in central
    found io.spray#spray-json_2.11;1.3.2 in central
    found com.github.rwl#jtransforms;2.4.0 in central
    found ml.combust.bundle#bundle-ml_2.11;0.15.0 in central
    found com.google.protobuf#protobuf-java;3.5.1 in central
    found com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 in local-m2-cache
    found com.thesamet.scalapb#lenses_2.11;0.7.0-test2 in local-m2-cache
    found com.lihaoyi#fastparse_2.11;1.0.0 in local-m2-cache
    found com.lihaoyi#fastparse-utils_2.11;1.0.0 in local-m2-cache
    found com.lihaoyi#sourcecode_2.11;0.1.4 in local-m2-cache
    found com.jsuereth#scala-arm_2.11;2.0 in central
    found com.typesafe#config;1.3.0 in local-m2-cache
    found commons-io#commons-io;2.5 in local-m2-cache
    found org.scala-lang#scala-reflect;2.11.8 in local-m2-cache
    found ml.combust.bundle#bundle-hdfs_2.11;0.15.0 in central
:: resolution report :: resolve 547ms :: artifacts dl 16ms
    :: modules in use:
    com.github.rwl#jtransforms;2.4.0 from central in [default]
    com.google.protobuf#protobuf-java;3.5.1 from central in [default]
    com.jsuereth#scala-arm_2.11;2.0 from central in [default]
    com.lihaoyi#fastparse-utils_2.11;1.0.0 from local-m2-cache in [default]
    com.lihaoyi#fastparse_2.11;1.0.0 from local-m2-cache in [default]
    com.lihaoyi#sourcecode_2.11;0.1.4 from local-m2-cache in [default]
    com.thesamet.scalapb#lenses_2.11;0.7.0-test2 from local-m2-cache in [default]
    com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 from local-m2-cache in [default]
    com.typesafe#config;1.3.0 from local-m2-cache in [default]
    commons-io#commons-io;2.5 from local-m2-cache in [default]
    io.spray#spray-json_2.11;1.3.2 from central in [default]
    ml.combust.bundle#bundle-hdfs_2.11;0.15.0 from central in [default]
    ml.combust.bundle#bundle-ml_2.11;0.15.0 from central in [default]
    ml.combust.mleap#mleap-base_2.11;0.15.0 from central in [default]
    ml.combust.mleap#mleap-core_2.11;0.15.0 from central in [default]
    ml.combust.mleap#mleap-runtime_2.11;0.15.0 from central in [default]
    ml.combust.mleap#mleap-spark-base_2.11;0.15.0 from central in [default]
    ml.combust.mleap#mleap-spark_2.11;0.15.0 from central in [default]
    ml.combust.mleap#mleap-tensor_2.11;0.15.0 from central in [default]
    org.apache.spark#spark-avro_2.11;2.4.4 from central in [default]
    org.scala-lang#scala-reflect;2.11.8 from local-m2-cache in [default]
    org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
    :: evicted modules:
    com.google.protobuf#protobuf-java;3.5.0 by [com.google.protobuf#protobuf-java;3.5.1] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   23  |   0   |   0   |   1   ||   22  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-dbeefc3f-8e12-443d-8629-8adf19670d42
    confs: [default]
    0 artifacts copied, 22 already retrieved (0kB/15ms)
20/02/12 17:34:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[I 17:34:59.073 NotebookApp] Adapting from protocol version 5.1 (kernel 56fef487-0dea-47ba-8ad3-8c19241c1193) to 5.3 (client).
20/02/12 17:35:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/02/12 17:36:22 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
20/02/12 17:36:23 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
20/02/12 17:36:23 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Exception in thread "Thread-4" java.lang.NoClassDefFoundError: ml/combust/bundle/serializer/SerializationFormat
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
    at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
    at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
    at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
    at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: ml.combust.bundle.serializer.SerializationFormat
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 9 more

The error from the notebook

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/2.4.4/libexec/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/2.4.4/libexec/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/Cellar/apache-spark/2.4.4/libexec/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-18-e6e5bbbb80b2> in <module>()
----> 1 sparkPipelineLr.serializeToBundle(f"jar:file:{root_dir}/out/pyspark.lr.zip", sparkPipelineLr.transform(dataset_imputed))
      2 sparkPipelineLogr.serializeToBundle(f"jar:file:{root_dir}/out/pyspark.logr.zip", dataset=sparkPipelineLogr.transform(dataset_imputed))

/usr/local/lib/python3.7/site-packages/mleap/pyspark/spark_support.py in serializeToBundle(self, path, dataset)
     22 
     23 def serializeToBundle(self, path, dataset=None):
---> 24     serializer = SimpleSparkSerializer()
     25     serializer.serializeToBundle(self, path, dataset=dataset)
     26 

/usr/local/lib/python3.7/site-packages/mleap/pyspark/spark_support.py in __init__(self)
     37     def __init__(self):
     38         super(SimpleSparkSerializer, self).__init__()
---> 39         self._java_obj = _jvm().ml.combust.mleap.spark.SimpleSparkSerializer()
     40 
     41     def serializeToBundle(self, transformer, path, dataset):

/usr/local/Cellar/apache-spark/2.4.4/libexec/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __getattr__(self, name)
   1596                 answer[proto.CLASS_FQN_START:], self._gateway_client)
   1597         else:
-> 1598             raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
   1599 
   1600 

Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM
MauricioLins commented 4 years ago

@felixgao have you fixed this problem? I am using the same versions and facing the same problem.

peterfig commented 4 years ago

I agree with others that this is a tricky dependency problem, not a problem with MLeap per se. Here is how I solved it on my MacBook:

spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.16.0 my_program.py

My PySpark version is 2.4.5 (see the MLeap Github page for what version of MLeap works with what version of Spark).

When I first ran spark-submit, I got a further error that Spark could not download some additional dependencies: these can be installed with Maven.

First, brew install maven from the command line.

Then, use maven from the command line to download dependencies. Here are the three I needed:

mvn org.apache.maven.plugins:maven-dependency-plugin:3.1.2:get -Dartifact=org.scala-lang:scala-reflect:2.11.12

mvn org.apache.maven.plugins:maven-dependency-plugin:3.1.2:get -Dartifact=com.google.protobuf:protobuf-java:3.5.1

mvn org.apache.maven.plugins:maven-dependency-plugin:3.1.2:get -Dartifact=com.typesafe:config:1.3.0

If you need different jars, you can find the coordinates by searching mvnrepository.com in your browser.

prasadpande1990 commented 4 years ago

Hi,

I am trying to build an AWS Sagemaker model which includes Spark pipeline model for feature transformation.

When I use mleap inside my docker container for serializing the pipelinemodel I am getting similar exception.

I am not very sure how can I use all these mleap jars into my docker container?

Can anyone help me to get around this?

gs-alt commented 3 years ago

Same issue here running pyspark 2.4.3 and mleap 0.17.0. I tried two things:

Adding all jar files manually to the jars folder in pyspark:

And running with the spark submit:

spark-submit --packages ml.combust.mleap:mleap-spark_2.12:0.17.0 main.py

Neither method worked

z7ye commented 2 years ago

got the same issue.

When running the code from the tutorial, fittedPipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip", fittedPipeline.transform(df2)) got the following error.

> ---------------------------------------------------------------------------
> TypeError                                 Traceback (most recent call last)
> /tmp/ipykernel_5527/4288136627.py in <module>
> ----> 1 fittedPipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip", fittedPipeline.transform(df2))
> 
> ~/conda/pyspark30_p37_cpu_v2/lib/python3.7/site-packages/mleap/pyspark/spark_support.py in serializeToBundle(self, path, dataset)
>      22 
>      23 def serializeToBundle(self, path, dataset=None):
> ---> 24     serializer = SimpleSparkSerializer()
>      25     serializer.serializeToBundle(self, path, dataset=dataset)
>      26 
> 
> ~/conda/pyspark30_p37_cpu_v2/lib/python3.7/site-packages/mleap/pyspark/spark_support.py in __init__(self)
>      37     def __init__(self):
>      38         super(SimpleSparkSerializer, self).__init__()
> ---> 39         self._java_obj = _jvm().ml.combust.mleap.spark.SimpleSparkSerializer()
>      40 
>      41     def serializeToBundle(self, transformer, path, dataset):
> 
> TypeError: 'JavaPackage' object is not callable

Any suggestions pls?

also, tried to install mleap from source, followed the instructions, but got this error.

[error] (mleap-core/compile:compileIncremental) javac returned nonzero exit code [error] Total time: 117 s, completed Nov 14, 2021 11:36:52 PM

drei34 commented 1 year ago

If you are using mleap 0.21.1 should serializeToBundle work? I am getting an error as below. Is the only option to go down? pyspark is 3.1.3. This is after resolving several other issues.

Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM

I make a spark context like this:

`def gen_spark_session(): return SparkSession.builder.appName("happy").config( "hive.exec.dynamic.partition", "True").config( "hive.exec.dynamic.partition.mode", "nonstrict").config( "spark.jars.packages", "ml.combust.mleap:mleap-spark_2.12:0.20.0," "ml.combust.mleap:mleap-spark-base_2.12:0.20.0" ).enableHiveSupport().getOrCreate()

spark = gen_spark_session()`

UPDATE: I was on Java 8 and apparently 0.21.1 is no good, it needs Java 11. I moved to 0.20.0 But, I still get this issue. I'm on Scala 2.12.