Closed priyeshkap closed 3 years ago
I had the same issue on #172 but it seem that if you change your last line to this
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip", featurePipeline.transform(df2))
it should work, seems like documentation has not been updated to reflect this
@dvaldivia @priyeshkap thanks for raising this issue, once https://github.com/combust/mleap-docs/pull/11 has been merged, the documentation should be up to date.
@dvaldivia @ancasarb Merged and published
@dvaldivia Hey, I have the same error as @priyeshkap. But it does not look like the way to call serializeToBundle is wrong, it is more that the featurePipeline object that is a Pipeline does not have any function called serializeToBundle. From there, we can't call it. I've tried with both syntax, none of them work.
Any other suggestions?
As we are still referencing to the pyspark object, I guess a part of the code should be altered by mleap to reference this serializeToBundle function. But it looks like it is not the case here.
I've tried both the approach detailed in the documentation and that detailed by @dvaldivia and I'm still getting the AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
error as well. Screenshot attached.
@nathanaelmouterde @robperc make sure you are importing mleap before pyspark
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
@dvaldivia Yes those lines are already the two first lines of my script
@nathanaelmouterde What version of Spark are you running? I encountered this issue (https://github.com/combust/mleap/issues/363) ... I downgraded from v2.3 to v2.2 and your stated issue no longer was a problem. It seems MLeap isn't ready for Spark v2.3 yet. :) As of MLeap v0.10.0, it supports Spark 2.3.
I encountered a similar problem. The error message is as follows.
py4j.protocol.Py4JJavaError: An error occurred while calling o94.serializeToBundle.
: java.lang.NoClassDefFoundError: resource/package$
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: resource.package$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 13 more
When I run my script, an exception is raised at the statement:
fitted_pipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip",fitted_pipeline.transform(df2))
Here is my script.
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row, SparkSession
from pyspark import SparkContext
from pprint import pprint
spark = SparkSession \
.builder \
.getOrCreate()
sc = spark.sparkContext
l = [('Alice', 10), ('Bob', 12), ('Alice', 13)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
pprint(string_indexer.getOutputCol())
feature_assembler = VectorAssembler(inputCols=[string_indexer.getOutputCol()],
outputCol='features')
feature_pipeline = Pipeline(stages=[string_indexer, feature_assembler])
fitted_pipeline = feature_pipeline.fit(df2)
fitted_pipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip",
fitted_pipeline.transform(df2))
@cappaberra thanks for your comment. I'm not getting this error anymore with MLeap v0.10.0. @nathanaelmouterde and I are using Spark v2.3.
@lie-yan I am getting the same error, were you able to find a resolution for it? I am running with mleap v0.12.0 with spark 2.3.1
File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1539093246812_0349/container_1539093246812_0349_01_000001/mleap.zip/mleap/pyspark/spark_support.py", line 25, in serializeToBundle
File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1539093246812_0349/container_1539093246812_0349_01_000001/mleap.zip/mleap/pyspark/spark_support.py", line 42, in serializeToBundle
File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1539093246812_0349/container_1539093246812_0349_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1539093246812_0349/container_1539093246812_0349_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1539093246812_0349/container_1539093246812_0349_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o121.serializeToBundle.
: java.lang.NoClassDefFoundError: resource/package$
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: resource.package$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 13 more
I tried the script you posted above as well - failing with the same error
*UPDATE the reason for the below error was because of missing jar in the spark.jars.packages configuration: ml.combust.mleap:mleap-spark_2.11:0.13.0 BUT - the model can be exported only with model.serializeToBundle("file:/mnt/mleap-example", sparkTransformed) AND NOT: model.serializeToBundle("jar:file:/tmp/20news_pipeline-json.zip", sparkTransformed) that should work according to the example , wonder why.....
i get similar problem, using mleap version 0.13.0 , pyspark 2.4.0 how can i point to the configuration file , there is no evidance of pyspark need to supply mleap configuration file in docs.... BTW - i tried also using pyspark 2.3.0 with 0.13.0 mleap version , like specified in docs , but the same error is happening.
import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSerializer from pyspark.ml import Pipeline from pyspark.ml.classification import GBTClassifier from pyspark.sql import SparkSession from pyspark.ml.feature import StringIndexer, FeatureHasher, StandardScaler, VectorAssembler,OneHotEncoderEstimator from pyspark.ml import Transformer , Estimator from pyspark.sql.functions import when
spark = SparkSession.builder.appName('yair-GBT')\ .config('spark.jars.packages',"ml.combust.mleap:mleap-spark-base_2.11:0.13.0,ml.combust.mleap:mleap-runtime_2.11:0.13.0").getOrCreate()
spark = SparkSession.builder.appName('GBT')\ .config('spark.jars.packages',"ml.combust.mleap:mleap-spark-base_2.11:0.13.0,ml.combust.mleap:mleap-runtime_2.11:0.13.0").getOrCreate() training = spark.createDataFrame([ \ (0, "a b c d e spark", 1.0),\ (1, "b d", 0.0),\ (2, "spark f g h", 1.0),\ (3, "hadoop mapreduce", 0.0) ], ["id", "text", "label"])
test_df = spark.createDataFrame([\ (4, "spark i j k"),\ (5, "l m n"),\ (6, "spark hadoop spark"),\ (7, "apache hadoop")], ["id", "text"])
categoricalCols = ["id","text"] stages = [] for cat_col in categoricalCols: stringIndexer = StringIndexer(inputCol = cat_col, outputCol = cat_col + 'Index') encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[cat_col + "classVec"]) stages += [stringIndexer,encoder] HashedInputs = [c + "classVec" for c in categoricalCols]
assembler = VectorAssembler(inputCols=HashedInputs,outputCol="features")
stages += [assembler] gbt = GBTClassifier(maxBins=4,maxDepth=4,maxIter=5) stages += [gbt] pipeline = Pipeline(stages=stages) model = pipeline.fit(training)
sparkTransformed = model.transform(training) model_name_export = "gbt_pipeline.zip" model_name_path = os.getcwd() model_file = os.path.join(model_name_path, model_name_export) model_file_path = "jar:file:{}".format(model_file) model.serializeToBundle(model_file_path, sparkTransformed)
giving:
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 19 | 18 | 18 | 1 || 18 | 18 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-4f79d946-c242-4400-8619-8055519379e3
confs: [default]
18 artifacts copied, 0 already retrieved (16182kB/35ms)
2019-01-24 14:12:06 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "test_mlflow_mleap.py", line 67, in
@Raghavendra Singh I am failling with the same error, have you solved it?
@SoloBean , @RaghavendraSingh , @lie-yan - have you been able to overcome the error of:
java.lang.NoClassDefFoundError: resource/package$
i am using pyspark 2.3.1 , with mleap 0.8.1 installed from pypi if the jar-with-dependencies build helped you - can you share it's location or the steps taken to produce it from the project ?
i think that this issue is relevant to https://github.com/combust/mleap/issues/257 , with automatic resource configuration . there are 2 open issues related: https://github.com/combust/mleap-docs/issues/8 https://github.com/combust/mleap/issues/343
: java.lang.NoClassDefFoundError: resource/package$ at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
i am using mleap 0.8.1 , it gets the scala-arm_2.11;2.0 version , java 1.8.0_192 - but it seems that something is still wrong there with the dependencies. @hollinwilkins - please advise.
It turns out that it was a dependency problem. I had to go to the maven repo and hunt down all the compilation dependencies and include them in my jars folder.
I am also experiencing this issue:
py4j.protocol.Py4JJavaError: An error occurred while calling o119.serializeToBundle.
: java.lang.NoClassDefFoundError: resource/package$
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
...
I believe that I have all the required jars. I had to download these:
mleap-base_2.11-0.13.0.jar
mleap-core_2.11-0.13.0.jar
mleap-runtime_2.11-0.13.0.jar
mleap-spark_2.11-0.13.0.jar
mleap-spark-base_2.11-0.13.0.jar
mleap-tensor_2.11-0.13.0.jar
bundle-hdfs_2.11-0.13.0.jar
bundle-ml_2.11-0.13.0.jar
scalapb-runtime_2.11-0.9.0-RC1.jar
lenses_2.11-0.9.0-RC1.jar
config-1.3.4.jar
Added all the dependencies mentioned by @enpinzolas , tried with Spark version 2.3.0, 2.3.3 and 2.4.3. Getting the following error
py4j.protocol.Py4JJavaError: An error occurred while calling o97.serializeToBundle.
: java.lang.NoClassDefFoundError: resource.package$
at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
Got it working with Spark 2.3 and 2.4 versions , added the following dependencies, besides the ones listed @enpinzolas scala-arm_2.11-2.0 spray-json_2.11-1.3.5 protobuf-java-3.8.0
For those of you using amazon emr pyspark: https://github.com/combust/mleap-docs/issues/23
Closing this issue as an effort to clean up some older issues, please re-open if there are still unanswered questions, thank you!
I am experiencing errors whilst trying to setup MLeap similar to that reported in: https://github.com/combust/mleap/issues/172 which is now marked as closed.
I am trying to run the simple spark example: http://mleap-docs.combust.ml/py-spark/ using an AWS EMR cluster. After logging into the master node I run this shell script to install the necessary packages:
Then I run the following code from the simple tutorial:
However I get the following error:
This error has been raised in other issues and a common solution is to check that the mleap import statements are executed first. I have ensured that this is the case, however I am still unable to run this code. I would be grateful for any advise to resolve this.